0% found this document useful (0 votes)
177 views91 pages

REPEAT 2 Architecture Patterns For Multi-Region Active-Active ARC213-R2

Uploaded by

thiri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views91 pages

REPEAT 2 Architecture Patterns For Multi-Region Active-Active ARC213-R2

Uploaded by

thiri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

ARC213-R

Architecture patterns for multi-region


active-active
Girish Dilip Patil Jonathan Dion Thomas Jackson
Senior Solutions Architect Senior Technical Evangelist Head of Core & Data Infrastructure
Amazon Web Services Amazon Web Services Wish

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda

Why do you need it?

Design principles

Foundational pillars

A customer’s journey: Wish’s path to multi-region


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Everything fails, all the time

Werner Vogels
CTO Amazon.com
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Guarding against failure of your applications in one region

Service 1 Service 1
Service 2
Applications Applications Service 2
in Canada in Mumbai
Service 3 Service 3
Service 4 Service 4

Users from Users from


Canada India
We need to reduce
blast radius of
adverse events
What is the problem with conventional DR solutions?

DR environments that aren’t used

1. Fall out of sync, eventually

2. Waste money
Advantage of multi-region active-active architecture

Serving geographically distributed customer base

Canada Mumbai Ohio Sydney

Users from Users from Users from Users from


Canada India USA Australia
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tolerance for network partitioning
1. Any problem in one region should not lead to failure of applications in
another
2. You should aim for regional independence for request serving
o Minimal blocking API or database calls from one region to another
o Graceful degradation of service in case network connectivity is lost

Backbone

Region A Region B
Minimal data replication requirements

• Does all data need to be replicated?

• If yes, does it need to be replicated synchronously?

• Does all data need to be replicated continuously?


Data classification

Catalogue Events,
Transactions Server logs
information objects

Purchase Product details Click stream HTTPS logs


record

Low volume, High volume,


but highly critical but less critical
Synchronous vs. asynchronous replication modes
Region A Region B
Pros: Guarantees
Write Replicate
consistency
App DB DB Cons: Network &
Ack. Ack.
target dependent
Sync
Async Pros: Network &
Write Replicate target independent

App DB DB Cons: Two databases


Ack. Ack.
can go out of sync
Ideal replication system
Should
• Report replication lag
• Report record offset
• Be able to retry replication of failed records

Try until successful

Source DB Replicator Target DB

Replication lag Record offset

High-level metrics
monitoring
Pattern 1: Read local, write global

Users in Users in
India Read & write Read Canada
Write

Mumbai Region Canada Central Region

Web Web
server server

App1 App2 App3 Snapshots App1 App2 App3


server server server Snapshots server server server Snapshots
AMIs: synchronization AMIs:
Web, app, database Web, app, database
Database Database
master replica
Database
synchronization
Pattern 2: Read local, write local
Multi-master, multi-region

Users in Users in
India Read + write Read + write Canada

Mumbai Region Canada Central Region

Web Web
server server

App1 App2 App3 App1 App2 App3


server server server Snapshots server server server Snapshots
AMIs: AMIs:
Web, app, database Web, app, database
Database Database
master master
Database
synchronization
Distributed system design best practices

Eventual
Idempotency Static stability
consistency

Exponential Circuit
Throttling
backup breaking

AWS Well-Architected Framework


Reliability Pillar whitepaper
http://bit.ly/31u0UbP
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Foundational pillars of a multi-region active-active architecture
Foundational pillars of a multi-region active-active architecture

High
availability
Region topology
AWS Region AWS Availability Zone (AZ)

Transit AZ
3
3
3 3
4 3 3
2 3 2 Data center Data center
3 3
3 6 2
4 AZ AZ
3 1
3
2
Data center
3

Transit AZ
3
3

A Region is a physical location Availability Zones consist of one


in the world where we have or more discrete data centers,
multiple Availability Zones each with redundant power,
networking, and connectivity,
Region & number of
housed in separate facilities
Availability Zones
Announced Regions
Cape Town, Jakarta, Milan, and Spain
Single-region high-availability approach
Leverage multiple Availability Zones (AZs)

Ohio
Availability Zone A Availability Zone B Availability Zone C

VPC

Instances Instances Instances

Database Database Database


Examples of services Multi-AZ
By default Multi-AZ

Amazon Amazon Amazon Amazon Amazon Amazon


S3 EFS DynamoDB QLDB Kinesis SQS

Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon


ElasticSearch RDS Aurora Neptune DocumentDB ElastiCache Managed MQ
Service Streaming
for Kafka

Configurable for Multi-AZ deployment


Foundational pillars of a multi-region active-active architecture

High Data
availability replication
S3 cross-region replication

Region A Region B

Backbone
• Automatically replicate data to any other AWS Region
• Replicate by object, bucket, or prefix
• Replication time control
S3 replication metrics

BytesPendingReplication ReplicationLatency

OperationsPendingCount
Amazon Elastic Block Store snapshots

• Point-in-time backup
• Stored in S3
• Incremental
EBS volume
EBS snapshot • Cross-region copy
DynamoDB Global Tables

Replica (Europe)

Replica (US)
Replica (Asia)
Global Tables management
Amazon CloudWatch metrics
• ReplicationLatency: Elapsed time of propagating
• PendingReplicationCount: Number of items written to one replica but not
propagated to other regions
Amazon RDS cross-region replication

Replica
Replica Master Replica

Replica
Multi-region consolidation of analytics data
US East

Amazon Redshift

Amazon Kinesis AWS Lambda Amazon Kinesis


Data Streams Data Firehose

Amazon Elasticsearch Service

US West

Amazon Kinesis AWS Lambda


Data Streams
Foundational pillars of a multi-region active-active architecture

High Data Networking


availability replication
Amazon VPC and software VPN

Canada Central Mumbai

VPC VPC

Software VPN Software VPN


appliance appliance
Internet Internet
gateway (IGW) gateway (IGW)
App1 App2 App3 App1 App2 App3
server server server server server server

Database Database
AWS Global Infrastructure

AWS global network The AWS Cloud spans:


• Redundant 100 GbE network 200 points of presence
69 Availability Zones
• Private network capacity between 22 geographic Regions around the world
all AWS Regions, except China
*With announced plans for 13 more Availability Zones and three more Regions in
Cape Town, Jakarta, Milan, and Spain
Inter-Region VPC peering

Canada Central Mumbai

VPC VPC

AWS backbone
VPC peering

App1 App2 App3 App1 App2 App3


server server server server server server

Database Database
Foundational pillars of a multi-region active-active architecture

High Data Networking Traffic


availability replication routing
Traffic routing with Amazon Route 53
Latency-based routing

Resource A

Amazon
Route 53
Resource B
*Latency numbers are only examples
Traffic routing with Amazon Route 53
Latency-based routing
Geolocation routing

Resource A
in Canada Central

User in Canada Amazon


Route 53
Resource B
in Mumbai
Traffic routing with Amazon Route 53
Latency-based routing
Geolocation routing
DNS failover
Resource A
in Canada Central

User in Canada Amazon


Route 53
Resource B
in Mumbai
Traffic routing with AWS Global Accelerator

54.86.52.59 2.3.4.5
52.45.82.211 1.2.3.4
54.86.52.59
52.45.82.211 54.86.52.59
52.45.82.211

3.4.5.6

All clients point to the same static IPs


and are directed to the closest PoP

Global Accelerator chooses the Global Accelerator endpoint with anycast IP


optimal AWS Region based on the e.g., 54.86.52.59, 52.45.82.211
geography of the client
Traffic routing with AWS Global Accelerator

54.86.52.59 2.3.4.5
52.45.82.211 1.2.3.4
54.86.52.59
52.45.82.211 54.86.52.59
52.45.82.211

3.4.5.6

All clients point to the same static IPs


and are directed to the closest PoP

Global Accelerator chooses the Global Accelerator endpoint with anycast IP


optimal AWS Region based on the e.g., 54.86.52.59, 52.45.82.211
geography of the client
Traffic routing with AWS Global Accelerator

54.86.52.59 2.3.4.5
52.45.82.211 1.2.3.4
54.86.52.59
52.45.82.211 54.86.52.59
52.45.82.211

3.4.5.6

All clients point to the same static IPs


and are directed to the closest PoP

Global Accelerator chooses the Global Accelerator endpoint with anycast IP


optimal AWS Region based on the e.g., 54.86.52.59, 52.45.82.211
geography of the client
DNS-based solutions vs. Global Accelerator
DNS-based traffic
Global Accelerator
management solutions
• Client devices can cache DNS • No reliance on IP address caching
answers for a long time of client devices
• Hard to know when users will • Reduced downtime (change
have the updated IP addresses propagation in seconds)
• Static IP addresses — No need to
backend failure, or update DNS or clients when
change in routing preferences moving endpoints across
Regions/AZs
Traffic routing with Amazon CloudFront + Lambda@Edge

Canada Central

AWS Edge location

Instances

User from CloudFront Lambda@Edge Mumbai


Canada distribution modify origin

Instances
Traffic routing with Amazon CloudFront + Lambda@Edge

Canada Central

AWS Edge location

Instances

User from CloudFront Lambda@Edge Mumbai


Canada distribution modify origin

Instances
Traffic routing with Amazon CloudFront + Lambda@Edge

Canada Central

AWS Edge location

Instances

User from CloudFront Lambda@Edge Mumbai


India distribution modify origin

Instances
Traffic routing with Amazon CloudFront + Lambda@Edge

Canada Central

AWS Edge location

Instances

User from CloudFront Lambda@Edge Mumbai


India distribution modify origin

Instances
Foundational pillars of a multi-region active-active architecture

High Data Networking Traffic Management


availability replication routing
Management of multi-region deployment

Management area AWS service

Security and compliance AWS Config Rules

Automation and inventory AWS Systems Manager

Monitoring and logs Amazon CloudWatch

Resources provisioning AWS CloudFormation StackSets


AWS CloudFormation StackSets
Admin account Provision resources across multiple
AWS accounts and regions

StackSet https://amzn.to/2tNVHQl

Canada Central Mumbai

Target Target Target Target


account A account B account A account B

Stack Stack Stack Stack


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-Region Architectures
Why, what, and how
Me: Thomas Jackson

● Head of Core & Data Infrastructure @ Wish

● Work experience:
○ Network Engineer
○ Corporate IT
○ Small Startups
○ Freelance Work
○ LinkedIn (professional social network)
○ Wish (mobile-first ecommerce platform)
About Us

Who We Are
Leading mobile commerce
platform in US and EU.

Our Mission
To offer the most affordable,
convenient, and effective mobile
shopping mall in
the world.
Global Reach

500M+
Users

1M+
Merchants

200M+
Items
Wish (the company)

600+
Employees

7
Offices

$11.2B
Valuation
High-Level Architecture (before)

AWS Cloud

● Hybrid cloud Region

○ Single AWS Region


○ Multiple data centers with backbone/DX

● Read-heavy application
○ ~90%+ reads
○ Globally sharded+replicated database On-premises
Backbone
DC1 DC3

DC2 DC4
High-Level Architecture (before)

AWS Cloud

● Hard AZ split Region

○ AZ is already the AWS failure unit Availability Zone Availability Zone

○ “Simple”: self-contained AZ (mostly)


○ Avoid cross-AZ network transfer costs
■ Some (such as DB replication) are required

On-premises
DC1 Backbone DC3

DC2 DC4
Cross-AZ transfer costs

Availability Zone Availability Zone

● What?
○ Service fan-out means at each hop we have a
chance for cross-AZ traffic
Bandwidth total (Gbps)

Bandwidth per AZ (Gbps)


High-Level Architecture (before)

● Monitoring
○ Prometheus: scraping, storage, alerting
○ Promxy: aggregation, alerting
○ Trickster: caching
High-Level Monitoring Architecture (before)

Availability Zone Availability Zone

Promxy

More details: bit.ly/2Lilbf0


Why?

Why change anything?


Why?

● Scalability
○ ICE: InsufficientInstanceCapacity
● The cloud is elastic, until it isn’t
○ An issue in “crunch” times
○ An instance type might become “hot”
in one or more AZs
○ Mitigate by using many instance types
Why?

● Availability
● At our scale, an outage is too costly
○ Any DR adds complexity, not always worth it
The plan (v1): Active/Active

● Passive is expensive and doesn’t solve our scale issues


The plan (v1): Scope

● Focus on Mean Time To Recovery (MTTR)


○ Uptime isn’t free (it's not even cheap!)
○ Not every system requires 99.999% uptime
● Focus on where we can make the most impact
○ Include user-facing services
○ Excluded other services (e.g., data analytics) that can handle a brief service interruption

User-facing system Data analytics


The plan (v1): Tech-Debt Cleanup

● Deprecations
○ Chef
○ Icinga
○ Graphite
○ Outdated OS versions
○ Etc.
The plan (v1): Newer systems

● Saltstack
● Kubernetes
● Prometheus
The plan (v1): Change plan

● Supporting multi-region requires changes


● Changes should be applied in current region
○ Minimizes “change” when bringing up new site
○ Avoids “drift” of system during planning and execution
The plan (v1): Deadline

● End of October (before November holiday shopping season)


Hurdle 1: Internal network

● Context/background
○ Very little cross-AZ traffic, but some
○ Need to support cross-region as well (e.g., notifications cluster)
● Solutions
○ Inter-region VPC peering for AWS <-> AWS
○ Backbone for Colo <-> Colo
Hurdle 2: Data consistency

● Main app DB -- Globally sharded DB


○ Single primary per shard — auto failover through election
■ As a read-heavy application, we can take a ~70ms hit on writes
○ Most reads can be “stale”
■ split reads between primary/secondary reads for performance/capacity
AWS Cloud On-premises

Region Replication
DC1 DC3
Primary read & write
P S

DC2 DC4
Secondary read
S S
Hurdle 2: Data consistency

● S3
○ Images, static content, etc.
○ Unidirectional cross-region replication

AWS Cloud

Region Region
Replication over
AWS backbone
Hurdle 3: Traffic routing

Region West Region East


● External
○ DNS: Don’t want different domain names for users
○ Route 53 geo-based balancing

● Internal User in West User in East


○ Migrated “the last few things” to service discovery
(tech debt cleanup)
An aside on system behavior

● Systems tend to do what they can, not what they should


○ “no hard coded IPs” — but if it works, someone might
○ “Handle 100% traffic increase due to region failover”
● If you don’t want X to be done, don’t allow X or audit for X regularly
○ In k8s, hard-coded IPs don’t work
○ Regularly scheduled failovers force discipline
Hurdle 4: Monitoring

● Still using Prometheus and Cloudwatch


○ Internal metrics: 15s granularity
○ Cloudwatch metrics: API rate limits trying to ingest at that rate
● Tiered alerting (not all alerts require access to all
the data)
○ Global
○ Regional
○ Local
Monitoring/Alerting Architecture

AWS Cloud

Region Region
Global Promxy Global Alerts Global Promxy

Availability Zone Availability Zone Availability Zone Availability Zone

Regional Alerts
Regional Promxy Regional Promxy

Local Alerts
The best laid plans of mice and men
often go awry
- Robert Burns
Adjustments

● Laxed requirements for some deprecations (Chef)


● Some changes were canaried in the new region first
High-Level Architecture (after)

AWS Cloud

Region Region

Availability Zone Availability Zone Availability Zone Availability Zone

read write & consistent read

On-premise
Backbone
DC1 DC2 DC3 DC4

S S P S
High-Level Architecture (after)

AWS Cloud

Region Region

Availability Zone Availability Zone Availability Zone Availability Zone

read
write & consistent read
On-premise
Backbone
DC1 DC2 DC3 DC4

S S P S
Takeaways

● Understand your current architecture and why it is so (tech and


business)
● Have a reliable mechanism to measure everything
● Cleanup tech debt along the way (where practical)
● Set timeline and scope, but be ready to adapt
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Takeaways
• For reliability and high availability, use a Multi-AZ architecture as a first
step
• Multi-region active-active architectures help limit the blast radius in
cases of major adverse events
• There is higher reliability than conventional disaster recovery
architectures, at the expense of additional complexity, due to constant
usage of all regions
Takeaways
• Simplify
o Minimize blocking dependencies between Regions
o Minimize synchronous replications between Regions
o Graceful degradation in case of connectivity issues

• AWS provides replication solutions for different data stores, such as


DynamoDB Global Tables, RDS Cross-Region Read Replicas, S3 Cross-
Region Replication, etc.

• Analytics stacks are typically deployed in one Region only.


Takeaways
• AWS Managed Services uses the AWS Backbone to replicate

• Use Inter-Region VPC peering to manage your own replication

• Think about traffic routing and manage it using Global Accelerator,


Route 53 or CloudFront + Lambda@Edge

• Plan to manage the environment using tools like CloudFormation


StackSets
Related breakouts
ARC309: Hands-on: Building a multi-region active-active solution
SVS337: Best practices for building multi-region, active-active serverless
applications
ARC406: Building multi-region microservices
DAT308: Real case on boosting performance with Amazon ElastiCache for
Redis
ARC304: From one to many: Diving deeper into evolving VPC design
NET202: Using AWS Global Accelerator for multi-region applications
Learn to architect with AWS Training and Certification
Resources created by the experts at AWS to propel your organization and career forward

Free foundational to advanced digital courses cover AWS services and


teach architecting best practices

Classroom offerings, including Architecting on AWS,


feature AWS expert instructors and hands-on labs

Validate expertise with the AWS Certified Solutions Architect - Associate


or AWS Certification Solutions Architect - Professional exams

Visit aws.amazon.com/training/path-architecting/

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Girish Dilip Patil Jonathan Dion Thomas Jackson
linkedin.com/in/girish-cloud linkedin.com/in/jotdion [email protected]
@jotdion linkedin.com/in/jacksontj
github.com/jacksontj

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like