0% found this document useful (0 votes)

311 views76 pages

AWS Portfolio

This document summarizes Roy Ben-Alta and Thomas Barthelemy's presentation on how Coursera used AWS services to build their big data analytics platform. It discusses three phases: 1) consolidating data from various sources using AWS Data Pipeline, EMR, and Redshift; 2) making the data easier for analysts to use through a normalized data model, lightweight SQL interface, and dashboards; and 3) increasing reliability through persisting Redshift logs, automated data quality checks, step retries, and an on-call system to recover failed pipelines. The presentation concludes by noting current bottlenecks around complex ETL processes and the developer workload.

Uploaded by

gopugg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

311 views76 pages

AWS Portfolio

Uploaded by

gopugg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Roy Ben-Alta (Big Data Analytics Practice Lead, AWS)

Thomas Barthelemy (Software Engineer, Coursera)

November 13, 2014 | Las Vegas | NV

Collect AWS Direct AWS Amazon
Connect Import/Export Kinesis

Store
Amazon Amazon
DynamoDB
S3 Glacier

Process & Analyze Amazon Amazon Amazon

EMR Redshift EC2
Billing CSV CRM
DB files DB

Extract Transform Load

ETL DWH
DB
DB
Data Amazon
Feeds RDS
Amazon S3 as Amazon EC2 Amazon EMR
Landing Zone

AWS Data Pipeline

Amazon Amazon
Amazon S3 as
Redshift DynamoDB
Data Lake
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-
program-pipeline.html
Amazon S3 as Data Logs Amazon EMR Amazon Redshift – Visualization
Lake/Landing Zone as ETL Grid Production DWH
Define
aws datapipeline create-pipeline –name myETL --unique-id token
--output: df-09222142C63VXJU0HC0A
Import
aws datapipeline put-pipeline-definition --pipeline-id df-
09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json
Activate
aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A
copy weblogs
from 's3://staging/'
credentials 'aws_access_key_id=<my access
key>;aws_secret_access_key=<my secret key>'
delimiter ','
Coursera Big Data Analytics Powered
by AWS
Thomas Barthelemy
SWE, Data Infrastructure
[email protected]
Overview
● About Coursera
● Phase 1: Consolidate data
● Phase 2: Get users hooked
● Phase 3: Increase reliability
● Looking forward
Coursera at a Glance
About Coursera
● Platform for Massive Open Online Courses
● Universities create the content
● Content free to the public
Coursera Stats
● ~10 million learners
● +110 university partners
● >200 courses open now
● ~170 employees
The value of data at Coursera
● Making strategic and tactical decisions
● Studying pedagogy
Becoming More Data-Driven
● Since the early days, Coursera has understood the value of data
o Founders came from machine learning
o Many of the early employees researched with the founders
● Cost of data access was high
o each analysis requires extraction, pre-processing
o data only available to data scientists + engineers
Phase 1:
Consolidate data
Sources
● MySQL
SQL
o Site data
o Course data sharded across multiple database
Sharded SQL
Classes
Classes
● Cassandra increasingly used for course data
● Logged event data
NoSQL
● External APIs

Logs

External
APIs
(Obligatory meme)
What platforms to use?
● Amazon Redshift had glowing recommendations
● AWS Data Pipeline has native support for various Amazon services
ETL development was slow :(
● Slow to do the following in the console:
o create one pipeline
o create similar pipelines
o update existing pipelines
Solution: Programmatically create pipelines
● Break ETL into reusable steps
o Extract from variety of sources
o Transform data with Amazon EMR, Amazon EC2, or within Amazon Redshift
o Load data principally into Amazon Redshift
● Use Amazon S3 as intermediate state for Amazon EMR- or Amazon
EC2-based transformations
● Map steps to set of AWS Data Pipeline objects
Example step: extract-rds
● Parameters: hostname, database name, SQL
● Creates pipeline objects:

S3 Node can be
used by many
other step types
Example load definition
steps:
- type: extract-from-rds Extract data from
sql: | SELECT instructor_id, course_id, rank Amazon RDS
FROM courses_instructorincourse;
hostname: maestro-read-replica
database: maestro

- type: load-into-staging-table
Load intermediate
table: staging.maestro_instructors_sessions
table in Amazon
- type: reload-prod-table Redshift
source: staging.maestro_instructors_sessions
Reload target table
destination: prod.instructors_sessions with new data
ETL – Amazon RDS

Amazon
SQL Amazon
S3
Redshift

Extract Load
ETL – Sharded RDS

Load
Extract Amazon
S3
Amazon
EMR

Amazon Amazon Amazon Amazon

SQL S3 EMR S3 Redshift

Amazon
EMR
Amazon
S3

Transform
ETL – Logs

Load
Extract

Amazon
S3

Amazon Amazon Amazon

EC2 S3 Redshift

Amazon Amazon
Logs
EMR S3

Transform
Reporting Model, Dec 2013
Reporting Model, Sep 2014
AWS Data Pipeline
● Easily handles starting/stopping of resources
● Handles permissions, roles
● Integrates with other AWS services
● Handles “flow” of data, data dependencies
Dealing with large pipelines
● Monolithic pipelines hard to maintain
● Moved to making pipelines smaller
o Hooray modularity!
● If pipeline B depended on pipeline A, just schedule it later
o Add a time buffer just to be safe
Setting cross-pipeline dependencies
● Dependencies accomplished using a script that would wait until
dependencies finished
o ShellCommandActivity to the rescue
The beauty of ShellCommandActivity
● You can use it anywhere
o Accomplish tasks that have no corresponding activity type
o Override native AWS Data Pipeline support if it does not meet your needs
ETL library
● Install on machine as first step of each pipeline
o With ShellCommandActivity
● Allows for even more modularity
Phase 2:
Get users hooked
We have data. Now what?
● Simply collecting data will not make a company data-driven
● First step: make data easier for the analysts to use
Certifying a version of the truth
● Data Infrastructure team creates a 3NF model of the data
● Model is designed to be as interpretable as possible
Example: Forums

● Full-word names for database objects

● Join keys made obvious through universally-
unique column names
o e.g. session_id joins to session_id
● Auto-generated ER-diagram, so
documentation is up-to-date
Some power users need data faster!
● Added new schemas for developers to load data
● Help us to remain data driven during early phases of…
o Product release
o Analyzing new data
Lightweight SQL interface
● Access via browser
o Users don’t need database credentials or special software
● Data exportable as CSV
● Auto-complete for DB object names
● Took ~2 days to make
Lightweight SQL interface
But what about people who don’t know SQL?
● Reporting tools to the rescue!
● In-house dashboard framework to support:
o Support
o Growth
o Data Infrastructure
o Many other teams...
● Tableau for interactive analysis
But what about people who don’t know SQL?
Phase 3:
Increase reliability
If you build it, they will come
● Built ecosystem of data consumers
o Analysts
o Business users
o Product team
Maintaining the ecosystem
● As we accumulate consumers, need to invest effort in...
o Model stability
o Data quality
o Data availability
Persist Amazon Redshift logs
to keep track of usage
● Leverage database credentials to understand usage patterns
● Know whom to contact when
o Need to change the schema
o Need to add a new set of tables similar to an existing set (e.g. assessments)
Data Quality
● Quality of source systems (GIGO)
o Encourage source system to fix issues
● Quality of transformations
o Responsibility of the data infrastructure team
Quality Transformations
● Automated QA checks as part of ETL
o Again, use ShellCommandActivity
● Key factors
o Counts
o Value comparisons
o Modeling rules (primary keys)
Ensuring High Reliability
● Step retries catch most minor hiccups
● On-call system to recover failed pipelines
● Persist DB logs to keep track of load times
o Delayed loads can also alert on-call devs
o Bonus: users know how fresh the data is
o Bonus: can keep track of success rate
● AWS very helpful in debugging issues
Looking Forward
Current bottlenecks
● Complex ETL
o How do we easily ETL complex, nested structures?
● Developer bottleneck
o How do we allow users to access raw data more quickly?
Amazon S3 as principal data store
● Amazon S3 to store raw data as soon as it’s produced (data lake)
● Amazon EMR to process data
● Amazon Redshift to remain as analytic platform
Thank you!
Want to learn more?
● github.com/coursera/dataduct
● http://tech.coursera.org
● [email protected]
http://blogs.aws.amazon.com/bigdata/

http://docs.aws.amazon.com/datapipeline/lates
t/DeveloperGuide/dp-concepts-schedules.html
http://bit.ly/awsevals

AI Project Report On E-COMMERCE
No ratings yet
AI Project Report On E-COMMERCE
20 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
சதாசிவ தியானம்
No ratings yet
சதாசிவ தியானம்
33 pages
Understanding AWS Core Services - Guided Notes - Completed
No ratings yet
Understanding AWS Core Services - Guided Notes - Completed
42 pages
Getting Started With Amazon EC2: Ian Massingham, Technical Evangelist at AWS
No ratings yet
Getting Started With Amazon EC2: Ian Massingham, Technical Evangelist at AWS
73 pages
Service Offerings: AWS Price Calculator Azure Calculator
No ratings yet
Service Offerings: AWS Price Calculator Azure Calculator
3 pages
Architecting On Aws1
No ratings yet
Architecting On Aws1
4 pages
Project DevOps
No ratings yet
Project DevOps
24 pages
AWS CP - Sruya Kiran Sir Notes
No ratings yet
AWS CP - Sruya Kiran Sir Notes
8 pages
Virtual Private Cloud (Amazon VPC) : Naresh I Technologies Amazon Web Services Avinash Reddy T
No ratings yet
Virtual Private Cloud (Amazon VPC) : Naresh I Technologies Amazon Web Services Avinash Reddy T
33 pages
Ec2 Ug PDF
No ratings yet
Ec2 Ug PDF
722 pages
AWS Certification Preparation Notes
No ratings yet
AWS Certification Preparation Notes
25 pages
Developers Guide To Aws
No ratings yet
Developers Guide To Aws
20 pages
Aws Overview
No ratings yet
Aws Overview
36 pages
DevOps With AWS by MR Veerababu Naresh IT
No ratings yet
DevOps With AWS by MR Veerababu Naresh IT
17 pages
Light Mode - 341 - AWS Certified Solutions Architect-PDF - 1574730476 PDF
No ratings yet
Light Mode - 341 - AWS Certified Solutions Architect-PDF - 1574730476 PDF
266 pages
04) Aws-Cli
No ratings yet
04) Aws-Cli
9 pages
1 AWS EC2 Interview Questions - MindMajix
No ratings yet
1 AWS EC2 Interview Questions - MindMajix
25 pages
List of Content: AWS Notes Start Here
No ratings yet
List of Content: AWS Notes Start Here
8 pages
Amazon S3 (API Version 2006-03-01)
100% (1)
Amazon S3 (API Version 2006-03-01)
171 pages
AWS SolutionsArchitect-Associate Version 3.0-2
0% (1)
AWS SolutionsArchitect-Associate Version 3.0-2
6 pages
AWS EC2 Interview Questions - MindMajix
No ratings yet
AWS EC2 Interview Questions - MindMajix
27 pages
AWS Certified SysOps Administrator
No ratings yet
AWS Certified SysOps Administrator
3 pages
SAA Roadmap in 8 Weeks - Cloudemind Copy 3
No ratings yet
SAA Roadmap in 8 Weeks - Cloudemind Copy 3
23 pages
01 Aws
No ratings yet
01 Aws
35 pages
Lab Requirements: AWS Solution Architect Associate Training
No ratings yet
Lab Requirements: AWS Solution Architect Associate Training
1 page
REPEAT 1 Modernizing Microsoft SQL Server On AWS WIN301-R1
No ratings yet
REPEAT 1 Modernizing Microsoft SQL Server On AWS WIN301-R1
54 pages
Aws General
No ratings yet
Aws General
1,114 pages
Case Study Based On: Cloud Deployment and Service Delivery Models
No ratings yet
Case Study Based On: Cloud Deployment and Service Delivery Models
10 pages
WWW Acte in AWS Training in Hyderabad
No ratings yet
WWW Acte in AWS Training in Hyderabad
18 pages
Cloud Computing Lab - Manual
No ratings yet
Cloud Computing Lab - Manual
30 pages
Aws - Sa Notes
No ratings yet
Aws - Sa Notes
68 pages
Aws CJ Saa en Kickoff 2023 Nov
No ratings yet
Aws CJ Saa en Kickoff 2023 Nov
43 pages
Aws Services
No ratings yet
Aws Services
14 pages
Aws Cloud9 Ug
No ratings yet
Aws Cloud9 Ug
596 pages
AWS CloudFormation Basics
No ratings yet
AWS CloudFormation Basics
13 pages
Amazon Web Services Training
No ratings yet
Amazon Web Services Training
5 pages
AWS Certified Developer Associate Updated June 2018 Exam Guide
No ratings yet
AWS Certified Developer Associate Updated June 2018 Exam Guide
3 pages
Python Notes
No ratings yet
Python Notes
49 pages
AcademyCloudFoundations Module 01
No ratings yet
AcademyCloudFoundations Module 01
47 pages
Introduction To Aws Simple Storage Service (Amazon S3) : What Is Cloudformation?
No ratings yet
Introduction To Aws Simple Storage Service (Amazon S3) : What Is Cloudformation?
11 pages
Aws S3
No ratings yet
Aws S3
16 pages
AWS-Storage Services V2
No ratings yet
AWS-Storage Services V2
25 pages
RDS Aws
No ratings yet
RDS Aws
33 pages
Amazon Simple Storage Service: User Guide API Version 2006-03-01
No ratings yet
Amazon Simple Storage Service: User Guide API Version 2006-03-01
1,069 pages
Interview PDF
No ratings yet
Interview PDF
100 pages
Version Control Systems
No ratings yet
Version Control Systems
14 pages
CFN Ug
No ratings yet
CFN Ug
4,069 pages
AWS & DEVOPS Paint-Notes1
No ratings yet
AWS & DEVOPS Paint-Notes1
35 pages
Top 51 AWS Interview Questions (2023)
No ratings yet
Top 51 AWS Interview Questions (2023)
22 pages
Aws Rajesh Report
No ratings yet
Aws Rajesh Report
17 pages
AWS Step-by-Step+Instructions+-+Windows+Workloads
No ratings yet
AWS Step-by-Step+Instructions+-+Windows+Workloads
20 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
CSAA Whizcard Revised 19 07 2021
No ratings yet
CSAA Whizcard Revised 19 07 2021
119 pages
Docker Cheat Sheet
No ratings yet
Docker Cheat Sheet
50 pages
Java Notes
No ratings yet
Java Notes
14 pages
AWS Certified Developer Associate - Exam Guide
No ratings yet
AWS Certified Developer Associate - Exam Guide
16 pages
AWS Slides
100% (1)
AWS Slides
16 pages
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet
Microsoft AZURE® AZ-104 Administrator Practice Tests
From Everand
Microsoft AZURE® AZ-104 Administrator Practice Tests
iCertify Training
No ratings yet
AWS Service Catalog
50% (2)
AWS Service Catalog
31 pages
AWS General Version 1.0
No ratings yet
AWS General Version 1.0
342 pages
Ux Methods Activities NN
No ratings yet
Ux Methods Activities NN
1 page
Vocabulary For Android Basics in Kotlin For Android Developers
100% (1)
Vocabulary For Android Basics in Kotlin For Android Developers
20 pages
Manithanum Marmangalum PDF
50% (2)
Manithanum Marmangalum PDF
189 pages
Yercaud Places To Visit
No ratings yet
Yercaud Places To Visit
1 page
Uxpin The 23-Point Ux Design Checklist 2
100% (1)
Uxpin The 23-Point Ux Design Checklist 2
19 pages
Mobile App UX Principles
100% (5)
Mobile App UX Principles
49 pages
Aarush's Resume
No ratings yet
Aarush's Resume
1 page
The State Of: Disaster Recovery
No ratings yet
The State Of: Disaster Recovery
7 pages
ACCG3055 Information Systems in Management Chapter 1: Disruptive IT Impacts Companies, Competition and Careers
No ratings yet
ACCG3055 Information Systems in Management Chapter 1: Disruptive IT Impacts Companies, Competition and Careers
29 pages
Installing Configuring Automation Orchestrator July2024
No ratings yet
Installing Configuring Automation Orchestrator July2024
89 pages
ETHICAL Hacking Till Mid
No ratings yet
ETHICAL Hacking Till Mid
11 pages
Solaris Containers and ZFS Cheat Sheet
No ratings yet
Solaris Containers and ZFS Cheat Sheet
4 pages
Dice Resume CV Shilpa Kasthala
No ratings yet
Dice Resume CV Shilpa Kasthala
5 pages
Asif Ahamed CV
No ratings yet
Asif Ahamed CV
3 pages
QUIZ
No ratings yet
QUIZ
7 pages
Enabling Clone Detection For Ethereum Via Smart Contract Birthmarks
No ratings yet
Enabling Clone Detection For Ethereum Via Smart Contract Birthmarks
11 pages
APIs For Maintenance Management-Pdfdownload
No ratings yet
APIs For Maintenance Management-Pdfdownload
225 pages
A Study On Customer Satisfaction of Mobile Wallet Services Provided by Paytm
No ratings yet
A Study On Customer Satisfaction of Mobile Wallet Services Provided by Paytm
8 pages
Scenario - Based Power BI Interview Q&A-1
No ratings yet
Scenario - Based Power BI Interview Q&A-1
9 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Embedding Blockchain Technology Into Iot For Security: A Survey
No ratings yet
Embedding Blockchain Technology Into Iot For Security: A Survey
22 pages
Santhosh Sarangi Resume
No ratings yet
Santhosh Sarangi Resume
3 pages
Visvesvaraya Technological University: Belgavi, Karnataka - 590 018
No ratings yet
Visvesvaraya Technological University: Belgavi, Karnataka - 590 018
4 pages
L6 DFS
No ratings yet
L6 DFS
27 pages
Report
No ratings yet
Report
26 pages
2020 Sonicwall Cyber Threat Report
No ratings yet
2020 Sonicwall Cyber Threat Report
30 pages
Salesforce Single Sign On
No ratings yet
Salesforce Single Sign On
42 pages
4 Requirement Analysis and Specification
No ratings yet
4 Requirement Analysis and Specification
3 pages
ServiceNow GRC Training Course
No ratings yet
ServiceNow GRC Training Course
2 pages
What Is A Walkthrough
No ratings yet
What Is A Walkthrough
1 page
Unit-4 - IoT Platforms and Security
No ratings yet
Unit-4 - IoT Platforms and Security
23 pages
Big Data For 5G Intelligent Network Slicing
No ratings yet
Big Data For 5G Intelligent Network Slicing
7 pages
01-07 ARP Security Configuration
No ratings yet
01-07 ARP Security Configuration
80 pages
I. Read The Following Sentences and Decide Whether They Are TRUE or FALSE (1pt)
No ratings yet
I. Read The Following Sentences and Decide Whether They Are TRUE or FALSE (1pt)
4 pages
EXAM PRACTICE - Session 1
No ratings yet
EXAM PRACTICE - Session 1
23 pages

AWS Portfolio

Uploaded by

AWS Portfolio

Uploaded by

Roy Ben-Alta (Big Data Analytics Practice Lead, AWS)

Thomas Barthelemy (Software Engineer, Coursera)

November 13, 2014 | Las Vegas | NV

Process & Analyze Amazon Amazon Amazon

Extract Transform Load

AWS Data Pipeline

Amazon Amazon Amazon Amazon

Amazon Amazon Amazon

● Full-word names for database objects

You might also like