AWS Portfolio
AWS Portfolio
AWS Portfolio
Store
Amazon Amazon
DynamoDB
S3 Glacier
DB
ETL DWH
DB
DB
Data Amazon
Feeds RDS
Amazon S3 as Amazon EC2 Amazon EMR
Landing Zone
Amazon Amazon
Amazon S3 as
Redshift DynamoDB
Data Lake
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-
program-pipeline.html
Amazon S3 as Data Logs Amazon EMR Amazon Redshift – Visualization
Lake/Landing Zone as ETL Grid Production DWH
Define
aws datapipeline create-pipeline –name myETL --unique-id token
--output: df-09222142C63VXJU0HC0A
Import
aws datapipeline put-pipeline-definition --pipeline-id df-
09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json
Activate
aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A
copy weblogs
from 's3://staging/'
credentials 'aws_access_key_id=<my access
key>;aws_secret_access_key=<my secret key>'
delimiter ','
Coursera Big Data Analytics Powered
by AWS
Thomas Barthelemy
SWE, Data Infrastructure
[email protected]
Overview
● About Coursera
● Phase 1: Consolidate data
● Phase 2: Get users hooked
● Phase 3: Increase reliability
● Looking forward
Coursera at a Glance
About Coursera
● Platform for Massive Open Online Courses
● Universities create the content
● Content free to the public
Coursera Stats
● ~10 million learners
● +110 university partners
● >200 courses open now
● ~170 employees
The value of data at Coursera
● Making strategic and tactical decisions
● Studying pedagogy
Becoming More Data-Driven
● Since the early days, Coursera has understood the value of data
o Founders came from machine learning
o Many of the early employees researched with the founders
● Cost of data access was high
o each analysis requires extraction, pre-processing
o data only available to data scientists + engineers
Phase 1:
Consolidate data
Sources
● MySQL
SQL
o Site data
o Course data sharded across multiple database
Sharded SQL
Classes
Classes
● Cassandra increasingly used for course data
● Logged event data
NoSQL
● External APIs
Logs
External
APIs
(Obligatory meme)
What platforms to use?
● Amazon Redshift had glowing recommendations
● AWS Data Pipeline has native support for various Amazon services
ETL development was slow :(
● Slow to do the following in the console:
o create one pipeline
o create similar pipelines
o update existing pipelines
Solution: Programmatically create pipelines
● Break ETL into reusable steps
o Extract from variety of sources
o Transform data with Amazon EMR, Amazon EC2, or within Amazon Redshift
o Load data principally into Amazon Redshift
● Use Amazon S3 as intermediate state for Amazon EMR- or Amazon
EC2-based transformations
● Map steps to set of AWS Data Pipeline objects
Example step: extract-rds
● Parameters: hostname, database name, SQL
● Creates pipeline objects:
S3 Node can be
used by many
other step types
Example load definition
steps:
- type: extract-from-rds Extract data from
sql: | SELECT instructor_id, course_id, rank Amazon RDS
FROM courses_instructorincourse;
hostname: maestro-read-replica
database: maestro
- type: load-into-staging-table
Load intermediate
table: staging.maestro_instructors_sessions
table in Amazon
- type: reload-prod-table Redshift
source: staging.maestro_instructors_sessions
Reload target table
destination: prod.instructors_sessions with new data
ETL – Amazon RDS
Amazon
SQL Amazon
S3
Redshift
Extract Load
ETL – Sharded RDS
Load
Extract Amazon
S3
Amazon
EMR
Amazon
EMR
Amazon
S3
Transform
ETL – Logs
Load
Extract
Amazon
S3
Amazon Amazon
Logs
EMR S3
Transform
Reporting Model, Dec 2013
Reporting Model, Sep 2014
AWS Data Pipeline
● Easily handles starting/stopping of resources
● Handles permissions, roles
● Integrates with other AWS services
● Handles “flow” of data, data dependencies
Dealing with large pipelines
● Monolithic pipelines hard to maintain
● Moved to making pipelines smaller
o Hooray modularity!
● If pipeline B depended on pipeline A, just schedule it later
o Add a time buffer just to be safe
Setting cross-pipeline dependencies
● Dependencies accomplished using a script that would wait until
dependencies finished
o ShellCommandActivity to the rescue
The beauty of ShellCommandActivity
● You can use it anywhere
o Accomplish tasks that have no corresponding activity type
o Override native AWS Data Pipeline support if it does not meet your needs
ETL library
● Install on machine as first step of each pipeline
o With ShellCommandActivity
● Allows for even more modularity
Phase 2:
Get users hooked
We have data. Now what?
● Simply collecting data will not make a company data-driven
● First step: make data easier for the analysts to use
Certifying a version of the truth
● Data Infrastructure team creates a 3NF model of the data
● Model is designed to be as interpretable as possible
Example: Forums
http://docs.aws.amazon.com/datapipeline/lates
t/DeveloperGuide/dp-concepts-schedules.html
http://bit.ly/awsevals