0% found this document useful (0 votes)

78 views87 pages

Lecture Notes

Uploaded by

Rajan Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views87 pages

Lecture Notes

Uploaded by

Rajan Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Architecting Distributed

Cloud Applications
Jeffrey Richter
Microsoft Software Architect
Jeffrey Richter: Microsoft Azure Software
Architect, Wintellect Co-Founder, &
Author

Architecting Distributed Cloud Apps

[email protected] 6.5hr technology-agnostic course
YouTube: http://aka.ms/RichterCloudApps
www.linkedin.com/in/ EdX:
JeffRichter https://aka.ms/edx-devops200_9x-about

@JeffRichter
Jeffrey Richter: Microsoft Software Engineer, Wintellect
Co-Founder, & Author
Architecting Distributed Cloud
Apps
http://aka.ms/RichterCloudApps

[email protected]
www.linkedin.com/in/JeffRichter
@JeffRichter
Course purpose
 Properly architecting distributed cloud apps
requires a new mindset towards software
development and introduces many new terms
and patterns
 The purpose of this course is to delve into
many of these terms, patterns & engineering
trade-offs while being technology-agnostic
 Topics include: orchestrators, datacenters, containers, networking,
messaging, versioning, configuration, storage services, and disaster
recovery
Why cloud apps?
Feature Past Present
Clients Enterprise/Intranet Public/Internet
Demand Stable (small) Dynamic (small  massive)
Datacenter Single tenant Multi-tenant
Operations People (expensive) Automation (cheap)
Scale Up via few reliable (expensive) Out via lots of (cheap) commodity
PCs PCs
Failure Unlikely but possible Very likely
Machine Catastrophic Normal (no big deal)
 We must do things differently when building
loss

cost-effective, failure-resilient solutions

Example Past Present
Exceptions Catch, swallow & keep Crash & restart
running
Communicatio In order Out of order
n Exactly once Clients must retry & servers must be
Cloud computing is all about embracing
failure
 Some reasons why a service instance may fail
(stop)
 Developer: Unhandled exception
 DevOps: Scaling the number of service instances down
 DevOps: Updating service code to a new version
 Orchestrator: Moving service code from one machine to another
 Force majeure: Hardware failure (power supply, fans [overheating],
hard disk, network controller, router, bad network cable, etc.)
 Force majeure: Data center outages (natural disasters, attacks)
 Since failure is inevitable & unavoidable,
embrace it
 Architect assuming failures will happen; thinks cattle, not pets
 Use an orchestrator that avoids single points of failure
Infrastructure/Platform/Containers/Functions
as a Service
aka Orchestrators
 Manage a cluster’s (set of PC/VMs) lifecycle,
networking,
health, upgrades, scaling, &Region
deploys/runs service
code
Cluster’s Virtual Network
Service Code PC/VM PC/VM PC/VM
Repository Servic Servic Servic
e e e
Servic
Code Code Code
e
Code PC/VM PC/VM PC/VM
Load Servic Servic Servic
Balancer e e e
Code Code Code
Regions, availability zones, & fault
domains
Region
Your Availability Zone #1
(Independent power & networking)
app’s AZ #2
public Rack #1 Rack #2
endpoint Private
PC #1 PC #2 PC #1 PC #2
fiber-
VM #1 VM #1 VM #1 VM #1 optic
VM #2 VM #2 VM #2 VM #2
network
AZ #3

 A fault domain is a unit of failure

 Hierarchy: Planet/Region/Availability Zone/Rack/PC/VM
 Intra-service communication (replication): More fault tolerance =
higher latency
Applications consist of many
(micro)services E-Commerce Application
Inventory Service

Website Inventory #1
Web Site
Service Data store
Inventory #2
Load #1
Web Site
Balanc
#2
er Web Site
#3
Orders Service
Orders #1
Data store
Orders #2
Each service solves a
domain-specific problem & Orders #3
has exclusive access to its Orders #4
own data store
4 reasons to split a monolith into
microservices
Scale Independently Different Technology
(Balance cost with speed) Stacks
Photo Share Thumbnail Photo Share
Photo Share Thumbnail Thumbnail
Service
Photo Share Service
Thumbnail Service
Service
Photo Share Service Service
Service Service node.js
Service .NET

Each Service’s 1 2 3 n Services

SLA Service Services Services
99.99% 99.98% 99.97% 99.99n%
99.99% 260s/mo 520s/mo 780s/mo (n x
260s)/mo
99.999% 99.998% 99.997% 99.999n%
99.999% 26s/mo 52s/mo 78s/mo (n x
26s)/mo
Auto-scaling service instances
Periodically check queue length Periodically check resource
Service- usage
1 Service-1
Client(

Service-
s)

2 Load Service-2
Service- Balance
3 r
Service- Service-3
4

 Periodically check queue length/resource usage

 If growing  scale up; if shrinking  scale down
 Scheduled (day/night,
weekdays/weekends/holidays)
 You’re predicting load based on what you expect
 Potentially dangerous as actual load may be different than predicted
12-Factor Services (Apps)

http://12factor.net
12-factor services (1-5)
1. Single root repo; don’t share code with
another service
2. Deploy dependent libs with service
3. No config in code; read from environment
vars
4. Handle unresponsive service dependencies
robustly
5. Strictly separate build, release, & run steps
 Build: Builds a version of the code repo & gathers dependencies
 Release: Combines build with config  ReleaseId (immutable)
 Run: Runs service in execution environment
12-factor services (6-12)
6. Service is 1+ stateless processes & shares
nothing
7. Service listens on ports; avoid using (web)
hosts
8. Use processes for isolation; multiple for
concurrency
9. Processes can crash/be killed quickly & start
fast
10. Keep dev, staging, & prod environments
similar
The 12 factors are all about…
 Services should be simple to build, test, &
deploy
 Services should be lightweight
 Few dependencies (OS/language/runtime/libraries), run fast, & use
less RAM
 Services should give reproducible results on
developer PC as well as test, staging, &
production clouds
Containers
Container images & containers
 A container image is immutable & defines a
version of
a single service with its dependencies
(runtimes, etc.)
 Use the same container image everywhere: dev, test, staging,
production
 A container runs an PC/VM image in an isolated
Container: Svc- Container: Svc- Container: Svc-
environment
A:v1 v1
Svc-A B:v3 v3
Svc-B A:v2 v2
Svc-A
 Multiple containers
Lib-L v2(services)Lib-L
can run
v3side-by-side
Lib-Lwithin
v3 a single
PC/VM Lib-M v2
Runtime v5 Runtime v7 Runtime v6
Isolation versus density
More isolation More density

Hyper-V
Contain Proces
PC VM Contain
er s
er

Not
Hardware Shared Shared Shared Shared
shared

Not Not Not

OS Kernel Shared Shared
shared shared shared
System
Not Not Not Not
Resources Shared
shared shared shared shared
(ex: File
System)
OS kernel & container images
 Container image must match kernel
(Linux/Windows)
 However, a Windows Hyper-V container can host a Windows or Linux
container image PC
VM VM

Container Container Hyper-V Hyper-V C-1 C-2 C-3 C-4

-1 -2 Container Container
-3 -4

Kernel Kernel

OS Kernel (Linux/Windows) OS Kernel

Hypervisor (Xen/Hyper-V)
Orchestrator starts containers on cluster’s
PCs/VMs
PC/VM

Svc-A:v1

Container
Orchestrator Image
(Docker Client) Docker Daemon Registry
(Ports 2375 & Svc-
"docker run Svc- 2376) A:v1
Svc-
A:v1"
🛈 Orchestrator can Local Registry B:v3
Svc-
restrict container’s RAM Svc-
& CPU usage A:v1
A:v2
CI: Continuous Integration
CD: Continuous Delivery, & Deployment
Continuous
Code Check- Integration
Ins 1. Checks-out code
Source 2. Builds it
Code
3. Creates container
Repository
image

u s u s us
u o u o uo en
i n ry i n ry in ym
nt v e nt v e t
n lo
li li
Container Co De Co De o
C ep t
D Production
Image Test Staging
Registry

Modern DevOps is all about automation; any

failures
Networking
Communication
8 fallacies of distributed computing
http://www.rgoarchitects.com/Files/fallacies.pdf

Fallacy Effect
The network is reliable App needs error handling/retry
Latency is zero App should minimize # of requests
Bandwidth is infinite App should send small payloads
The network is secure App must secure its data/authenticate
requests
Topology doesn't change Changes affect latency, bandwidth, &
endpoints
There is one administrator Changes affect ability to reach destination
Transport cost is zero Costs must be budgeted
The network is Affects reliability, latency, & bandwidth
Service endpoints
 Original design: IP:Port  PC:Service
 Designed to allow a client to talk to a specific service running on a
specific PC
 On 1 IP, you can’t have 2+ services listening on the same port at the
same time
 Today: 1 PC hosts many VMs & 1 VM hosts
many containers; each can run a service
desiring the same port
 Virtualization (hacks) are required to make this
work
 Routing tables, SNAT/DNAT, modification to client code, etc.
 We need something better but too much legacy exists: network cards,
Service scalability & high-availability
 Making things worse, we run multiple service
instances
 For service failure/recovery & scale up/down
 So, instances’ endpoints dynamically change over the service’s
lifetime
 Ideally, we’d like to abstract this from client code
 Each client wants a single stable endpoint as the face of the
dynamically-changing service instances’ endpoints
 Typically, this is accomplished via a reverse
proxy
 NOTE: Every request goes through the RP causing an extra network
hop
 We’re losing some performance to gain a lot of simplification
Forward & reverse proxies
Client Server
Infrastructure Infrastructure
Client-1 (Forward Server-
Reverse 1
)
Proxy
Client-2 Proxy Server-
2
Processes outgoing requests: Processes incoming requests:
• Content filtering • Stable client endpoint over changing
(ex: censoring, translation) server instances’ endpoints
• Caching • Load balancing (Levels 4 [udp/tcp] &
• Logging, monitoring 7 [http]), server selection, A/B
testing
• Client anonymization
• SSL termination
• Caching
• Authentication/validation
• Tenant throttling/billing
• Some DDoS mitigation
Cluster DNS & service reverse proxy
⚠ It’s impossible to keep
endpoints in sync as service
instances come/go;
client code must be robust
against this DNS
Inventory  RP-I Endpoint
Orders  RP-O
Endpoint
Web Site Inventory
Load #1 #1
Web Site Inventory
Balance
#2 RP-I #2
r Web Site Inventory
#3 #3
⚠ WS #1 could Orders #1
fail before I #3 RP-
replies O Orders #2
Reverse proxy load balancer service
probes
Inventory
RP Load #1503
HTTP 200
Balancer
Inventory Inventory
#1
Inventory #2
(no
HTTPreply)
200
#2
Inventory
#3 Inventory
1. Seconds=15, Port=80, #3200
HTTP
Path=HealthProbe.aspx
2. Seconds=15, Port=8080,
Path=HealthProbe.aspx
Turning a monolith into a microservice
Explicit, language-agnostic, multi-version API
contract
(loss of IntelliSense, refactoring & compile-time type-safety)

(de)serializatio
var result = Method(arg1, arg2);
n
 In-process call  network request
 Performance: Worse, increases network congestion, unpredictable
timing
 Unreliable: Requires retries, timeouts, & circuit breakers
 Server code must be idempotent
 Security: Requires authentication, authorization, & encryption
 Required in VNET for compliance or running 3rd-party (untrusted)
code
 Diagnostics: network issues, perf counters/events/logs, causality/call
4 reasons to split a monolith into
microservices
Scale Independently Different Technology
(Balance cost with speed) Stacks
Photo Share Thumbnail Photo Share
Photo Share Thumbnail Thumbnail
Service
Photo Share Service
Thumbnail Service
Service
Photo Share Service Service
Service Service node.js
Service .NET

2+ Clients Conflicting
(Clients adopt new features at will) Dependencies
Photo Share Service
Photo Share SharedLib-
Thumbnail
Service (V1) Thumbnail v1 SharedLib-
Service v7
V1
V2
Video Share
Service (V1) Photo Share Thumbnail
Backward Service Service
compatibility must SharedLib- SharedLib-
be maintained v1 v7
API versioning
 Is an illusion; you must always be backward
compatible
 You’re really always adding new APIs & stating that latest version is
preferred
 The required “version” indicates which API to
call
 http://api.contoso.com/v1.0/products/users
 http://api.contoso.com/products/users?api-version=1.0
 http://api.contoso.com/products/users?api-version=2016-12-07
 Add new API when changing mandatory
parameters, payload format, error codes (fault
contract), or behavior
Defining network API contracts
 Define explicit, formal cross-language API/data
contracts
 “Contracts” defined via code do not work; do not do this
 Ex: DateTime can be null in Java but not in .NET; not all languages
support templates/generics, nullable types, etc.
 Consider https://www.openapis.org/ &
http://swagger.io/
 Use tools to create language-specific client libraries
 Beware of (de)serialization RAM/CPU costs
 Use cross-language data transfer formats
 Ex: JSON/XML, Avro, Protocol Buffers, FlatBuffers, Thrift, Bond, etc.
 Consider embedding a version number in the data structure
Beware leaky RPC-like abstractions
 To “simplify” programming, many technologies
try to
map method calls  network requests
 Examples: RPC, RMI, CORBA, DCOM, WCF, etc.
 These frequently don’t work well due to
 Network fallacies (lack of retries, timeouts, & circuit breakers)
 Chatty (method) versus chunky (network) conversations
 Language-specific data type conversions (ex: dates, times, durations)
 Versioning: Which version to call on the server?
 Authentication: How to handle expiring tokens?
 Logging: Log request parameters/headers/payload, reply
headers/payload?
 NOTE: Servers’ clocks are not absolutely synchronized
Clients must retry failed network
operations
 Client code must retry operations due to
 Network fallacies (timeout, topology changes [avoid sticky sessions])
 Server throttling
 Don’t immediately retry if service unavailable or on error reply
 Never assume a dependent service is already up & running
 To prevent DDoS attacking yourself
 Use exponential back-off & circuit breakers
 https://msdn.microsoft.com/en-us/library/dn589784.aspx
 Client retries assume server handles request
idempotently
Services must implement operations
idempotently
 An idempotent operation can be performed
2+ times with no ill effect
 Methods that input/process/output are
idempotent
 Repeatedly creating a thumbnail of a specific photo produces the
same result
 Methods with side-effects are not idempotent
Exactly Once Semantics
 Repeatedly adding $100 to a specific account produces different
results

 Retry & Idempotency

Idempotent CRUD considerations
Operation HTTP Verb What to do
C id = Create() POST See below pattern
data = GET/HEAD/OPTIONS/ Naturally
R
Read(id) TRACE idempotent
Update(id, PUT Last writer wins
U
data)
 HTTP requiresDELETE
D Delete(id) most verbs (not POST) begone, OK
If already
idempotent
 Idempotency pattern
1. Client: asks server to create a unique ID or client (if trusted) creates
an ID
2. Client: sends ID & desired operation to server  may be
retried
Messaging
Communication
http://ReactiveManifesto.
org/
Messaging communication
 The request/reply pattern is frequently not the
best
 Client may send to busy (not idle) service instance
 Client may crash/scale down while waiting for service instance’s reply
 So, consider messaging communication instead
 Resource efficient
 Client doesn’t wait for service reply (no blocked threads/long-lived
locks)
 Service instance pulls work vs busy service instances pushed more
work
 Services don’t need listening endpoints; clients/services talk to
queue service
 Resilient: client/service instances can come, go, & move at will
 If a service instance fails, another instance processes the message (1+ delivery,
Messaging with queues
Cluster
🛈 Request/reply isn’t required;
Service-B #1 could post to Q-WS1;
not to Q-A
Q-A Q-B
Service- Service-
A B
WebSite Q- #1 #1
Load #1 WS1 Service- Service-
Balanc
WebSite Q-
A B
#2 WS2
er WebSite Q- #2 #2
#3 WS3 Service-
A
#3
🛈 All Service-A & Service-B instances could go down and recovery is
automatic when any come back up; but if WebSite #1 goes down,
originator must retry
Fault-tolerant message processing
 Get msg: DequeueCount++ & hides msg for n
seconds Service-1
 If DequeueCount > threshold (2), log bad msg & delete it
 Else, after processing msg, delete msg from queue
 NOTE: Msgs can be processed 1+ times
& out of order 131
Client-1
Service-2
Client-2
30 210 12103
Client-3
1
221
Additional queue features
 A msg can be sent to multiple “subscribers”
 Allows single msg to be broadcast & processed in parallel
 Ex: Chat msgs, weather/stock/news updates
 Message TTL
 Prevents costs from skyrocketing should consumers never come
online or take too long to process messages
 Consumer-specified invisibility timeout
 Short: Service failure lets another service process the msg right away
 Long: Prevents msg from being processed multiple times
 Service can periodically update timeout if msg actively being
processed
 Service can periodically update msg content enabling efficient
continuation on failure
At-most-once message processing
 Good for time-sensitive data that expires/gets
replaced
 Ex: stock prices, temperature, sports scores, etc.
 Pattern
 Client places msg in queue with maximum TTL
 Service gets msg setting invisibility timeout > maximum TTL
 If consumer crashes, msg expires before becoming visible again
 Result: Msg is processed 0 or 1 times
Versioning Service Code
Service update options
Delete Rolling Blue-Green
& Upload Update Deployment
Cluster Cluster Cluster (or across 2 clusters)
Revers Controlled

V1
V2 V1
V2 V2
V1 V2
V1
e migration
(or VIP
Proxy swap)
V1
V2 V1
V2 V2
V1 V2
V1

V1 V1 V2 V2
V1
V2 V1
V2 V2
V1 V2
V1

V1 V1 V2 V2

V1 V1 V2 V2
Comparing service update options
Feature Delete Rolling Blue-Green
& Upload Update Deployment
Add’l Hardware Costs None None 1x to 2x
Service availability Downtime Reduced Same
scale
Failed update Downtime Reduced Immediate
recovery until scale after
V1 until rollback swap back
redeployed
V2 testability Not with V1 Not with V1 With V1
 Of course, you can 1-Phase
perform 2-Phase
some updates one
Protocol/Schema 1-Phase
way change
& other updates a different way
Rolling update: how to version APIs
 All API requests must pass version info starting
with v1
 New service versions must be backward
compatible
 What about intra-service instance requests?
 During rolling update, old & new service instances run together
 Failure occurs if v2 instance makes v2 API request to v1 service
instance
 Fix by performing a 2-phase update
1. Deploy v2 service instances (which accept v2 & v1 API requests)
 But never send v2 API requests
2. After all instances are v2, reconfigure instances to send v2 API
Gracefully shutting down a service
instance
 12-factor services are stopped via SIGTERM or
Ctrl-C
 Your service code should intercept this, and then…
 Drain inflight requests before stopping its
process
 Use an integer representing requests inflight; initialize to 1 (not 0)
 As requests start/complete, increment/decrement the integer
 To stop, answer all future LB probes with “not ready” so LB stops
sending traffic
 When you’re sure LB stops sending traffic (~30 seconds), decrement
integer
 When integer hits 0, the service process can safely terminate
 NOTE: Don’t let long-running inflight requests prevent process
Service Configuration
& Secrets
Service (re)configuration
 Use config for info that shouldn’t be in source
code
 Account names, secrets (passwords/certificates), DB connection
strings, etc.
 Use Cryptographic Message Syntax (CMS) to avoid clear text secrets
 12-factor services pass config via environment
variables
 Change config: stop process & restart it with new environment
variable values
 When using rolling upgrade to reconfigure, roll
back if new config causes service instance(s) to
fail
Cryptographic Message Syntax (CMS)
 Use CMS to avoid cleartext secrets
 CMS encrypts/decrypts messages via RFC-3852
 Secret producer
 Encrypts cleartext secret for a recipient using a certificate and
embeds the certificate’s thumbprint in cyphertext
 Set the desired setting’s value to the cyphertext
 Secret consumer
 Get the desired setting’s cyphertext value
 Decrypts cyphertext producing the cleartext secret for use in the
service code
 Decryption automatically uses the certificate referenced by the
embedded thumbprint; the certificate must be available
Leader Election
Leader election
 Picks 1 service instance to coordinate tasks
among others
 Leader can “own” some procedure or access to some resource
 At a certain time, chose 1 instance to do billing, report generation,
etc.
 Commonly used to ensure data consistency
 Aggregates results from multiple instances together
 Conserves resources by reducing chance of work being done by
multiple
service instances
 Problem: If leader dies, elect new leader
(quickly)
 These algorithms are hard to implement due to race conditions &
Leader election via a lease
 All service instances execute:
while (!AskDB_IsProcessingDone()) {
bool isLeader = RequestLease()
if (isLeader) {
ProcessAndRenewLease()  NOTE: may crash; lease
abandoned
TellDB_ProcessingIsDone()
} else { /* Continuously try to become the new leader
*/ }
Delay() // Avoid DB DDoS Database
} Service Lease
#1 Work time Done
Leasee expiration
Service (not
#2 2017-07-27 false
true (none)
#1
#3 (expired)
expired)
Service
#3
Leader election via queue message
 At a certain time, insert 1 msg into a
queue
 All service instances execute:
while (true) {
Msg msg = TryDequeueMsg()
if (msg != null) {
/* This instance is the leader */
ProcessMsg()  NOTE: may crash; msg becomes visible
again
DeleteMsg(msg)
} else { /* Continuously try to become the new leader
*/ }
Delay() // Avoid queue DDoS
}
Data Storage Services
Data storage service considerations
 Building reliable & scalable services that
manage state is substantially harder than
building stateless services
 Due to data size/speed, partitioning, replication, leader election,
consistency, security, disaster recovery, backup/restore, costs,
administration, etc.
 So, use a robust/hardened storage service
instead
 When selecting a storage service, fully understand your service’s
requirements & the trade-offs when comparing available services
 It is common to use multiple storage services from within a single
service
Data temperature

Hot/ Warm Cold

RAM Network service & Network service &
Local SSD/disk tape
Latency ms ms-sec min-hour
Cost/GB $$-$ $-¢¢ ¢
Request Very high High Low
rate
Durability Low-high High Very high
Max size MB-GB GB-TB PB-EB
Item size B-KB KB-MB KB-TB
A cache can improve performance
but introduces stale (inconsistent) data
Stateles Stateles Storage
s s Service

Cache
Web Comput
Other
Load e
Internal
Balanc Tiers
er ?
Object Storage Services
Object storage services
 The most frequently-used storage service
 Used for documents, images, audio, video, etc.
 Fast & inexpensive: GB/month storage, I/O requests, & egress bytes
 All cloud providers offer an object storage
service
 Minimal lock-in: It’s relatively easy to move objects across providers if
you
avoid provider-specific features
 Object storage services offer public (read-only)
access
 Give object URLs to clients; URL goes to storage service reducing load
on
your other services!
How a CDN works
US West DC Many Around the
Origin Server World
CDN PoP
(Object Storage
Service) Objec
Objec t
t

Nearby Nearby
Client Client
#1 #2
Objec Objec
t t
Database Storage
Services
DB storage services
 Store many small, related entities
 Features: query, joins, indexing, sorting, stored proc, viewers/editors,
etc.
 Rel-DBs (SQL) require expensive PC for better
size/perf
 For data relationships: a customer  orders
 Supports sorts, joins, & ACID updates
 NonRel-DBs (noSQL) spread data across many
cheap PCs
 For customer preferences, shopping carts, product catalogs, session
state, etc.
 Cheaper, faster, bigger & flexible data models (entity ≈ in-memory
object)
Relational DB vs non-relational DB

Service Service Non-Relational

#1 #1 Database
Service Service Partition
#2 #2 #1
Relational
Service Service Partition
Database
#3 #3 #2
(1 partition)
Service Partition
Service
#4 #3
#4
Service Complex Service
#5 CRUD, #5
joins, sorts, Joins,
Simple
stored procs, sorts,
CRUD
X-table txns etc.
Data partitioning & replicas
 Data is partitioned for size, speed, or both
 Architecting a service’s partitions is often the hardest part of
designing a service
 X-partition ops require network hops & different/distributed
transactions
 How many partitions depends on how much data you’ll have in the
future
 And how you intend to access that data

 Each partition’s data is replicated for

reliability
 Replicating state increases chance of data surviving 1+
simultaneous failures
 But, more replicas increase cost & network latency to sync replicas
 For some scenarios, data loss is OK
Replication: No failure scenario
(consistency & availability)
Database Stores

Replica

AAA
BBB
Replica
Load
Balanc
AAA
BBB
er
Replica

BBB
AAA
Data Consistency
Data consistency
 Strong: 2+ records form relationship at same
time
 ACID transactions: Atomicity, Consistency, Isolation, Durability
 Goal: looks like 1 thing at a time is happening even if work is
complex
 Done via distributed txs/locks across stores; hurts perf & not fault
tolerant
 Weak: 2+ records form relationship eventually
 BASE transactions: Basically Available, Soft state, Eventual
consistency
 Done via communication retries & idempotency across stores
 CAP theorem states
 When facing a network Partition (stores can’t talk to each other):
Replication: Network partition (failure)

Database Stores

Consistency: if enough stores don’t ack

the change, DB won’t respond to avoid
Replica
returning inconsistent data; new store
may come up AAA
Replica
Load
Balanc
AAA
er
Replica
Availability: stores don’t have to ack the
change, DB may respond with
inconsistent data (AAA or BBB) BBB
AAA
Consistency or availability: which is
better?
 Businesses love the service responding to
customers
 Developers love trusting data; but do you
really get this?
 No distributed tx across 2+ services’ DBs
 Ex: You can’t atomically transfer item from Inventory service  Order
service
 Web page/cache data gets out of sync with back-end truth
 CQRS: writes are asynchronous; reads are synchronous
 Apology-based computing
 If software models the real world, then the real world is the truth
 Physical example: Item physically destroyed during shipping
CQRS: Command Query Responsibility
Segregation
 Decouples command & query data models
 Each view can be complex (with relations) & (re)built in background

Service
Command Data store
m an Processor
Com (tables)
User
d AC
Interface QK Eventual
ue consistenc
Dy r y
at
a
Query Data store
Processor (views)
Event Sourcing
 Commonly used with CQRS & Big Data
 Save events in append-only & immutable
tables
Event Source Table for Jeffrey
Account
A Snapshot View for All Accounts
Accou Timestamp Amoun
Timestamp Amoun Memo nt t
t Jeffrey 2017-04- +
2017-04- +$0.00 New 17T01:05:22 $47.65
01T09:00:05 account … Pros
… …
2017-04- + Paycheck  When reading, no event
16T08:28:36 $100.00
Cons locking (good perf)
2017-04- -$52.35 Restaurant
 Boundless
17T01:05:22 storage (but it’s  Write bugs unlikely &
cheap)
… … … can’t corrupt immutable
 Replaying data is time- data
consuming (improve with  Easy to (re)build today,
historical or audit views
Implementing eventual consistency
 A client can determine when data is consistent
 Client reads entities A & B; If A references B but B doesn’t reference A
(yet),
client assumes the relationship doesn’t exist (yet)
 Use fault-tolerant message queues & the Saga
pattern
to guarantee that all operations eventually
complete
 Saga pattern compromises atomicity in order to give greater
availability
Saga pattern
 SEC attempts txs in concurrent or risk-centric
order
 For fault tolerance, operations may be retried & must be idempotent
 2 recovery modes Rental Car
 Backwards: If a tx fails, undo all successful txs Reservation
 Forwards: retry every tx until all are successful Service

Queue Saga Hotel

Execution Controller Reservation
Service

Airplane
Trip Trip Trip
Reservation
Saga Saga Saga Service
Concurrency, Versioning,
& Backup
Concurrency control
 A data entity maintains its integrity for
competing accessors via concurrency control
 Pessimistic: accessor locks 1+ entries (blocking
other accessors), modifies entries, & then
unlocks them
 Bad scalability (1 accessor at a time) & what if locker fails to release
the lock?
 Optimistic: accessor gets 1+ entries with
version IDs (etags), modifies entries’ values, &
updates entries if versions haven’t changed
(still contain the original values)
Optimistic concurrency:
Two instances adding $23 & $32
to Jeff’s balance
Service #1
DB Partition
Account: Jeff
Version: 0001 Account: Aidan
Balance: $100
$123 Version: 0023
Balance: $768

Account: Grant
Version: 0762
Balance: $444
Service #2
Account: Jeff Account: Jeff
Version:
Balance:
0001
0002
$100
$155
$123
$132
X Version:
Balance:
0001
0003
0002
$100
$155
$123
Versioning data schemas
 Use formal, language-agnostic data schemas
 The data is the truth; not the language data type
 All data must specify version info starting with
v1
 New services must be infinitely backward
compatible
 Service v1 might create an entry that isn’t accessed for years
 During rolling update, v1 & v2 instances run
together
 Failure occurs if v2 instance writes v2 data schema & v1 instance
reads it
 Fix by performing a 2-phase update
Backup & restore
 Needed to recover from corrupt data due to
coding bug or a hacker attack
 Detecting data corruption is a hard domain-specific problem
 You periodically backup data in order to restore
it to a known good state
 NOTE: Restore usually incurs some downtime & data loss
 Many DBs don’t support making a consistent backup across all
partitions
 Try hard to avoid cross-partition relationships
 Incremental backups are faster than a full backup but hurt restore
performance
 Make sure you test restore
Recovery point & time objectives
 Recovery Point Objective (RPO)
 Maximum data (in minutes) the business can afford to lose
 Recovery Time Objective (RTO)
 Maximum downtime the business can afford to lose when restoring
data
 Data loss & recovery cannot be completed
prevented
 The earth is a single point of failure
 Deciding RPO & RTO are mostly business decisions
 NOTE: The smaller the RPO/RTO, the more expensive it is to run the
service
Disaster Recovery (DR)
Disaster recovery
 Dealing with a datacenter outage
 Code: Easy, upload to other DC
 Data: Hard, data must be replicated across DCs
 Latency: ~133ms for ½ way around the earth round trip (best case)
 Create similar clusters in different geographical
regions
 When data changes in a cluster, replicate to
other cluster
 Usually batch changes & replicate periodically
 The delay is the RPO as first cluster could die before sending next
batch
 The more clusters, the more
Active/passive architecture
 Datacenter-A takes traffic & periodically
replicates data changes to Datacenter-B
 DC-A handles all traffic spikes
 DC-B has wasted capacity
 Code development is easy
 Failover is infrequently-tested
Datacenter- Datacenter-
A B
 Admin decides when to failover (Active) (Passive)
& manually initiates it Replication
Traffic
Active/active architecture
 Datacenters A & B take traffic &
periodically replicate data changes
to other DC
 Both DCs handle spikes
 Less expensive & less wasted capacity
 Continuously tested
Datacenter- Datacenter-
 Development is harder A B
 Data inconsistency or dual reads (Active) (Active)
Replication
 Failover is fast & automatic Traffic

Microservices Architecture Guide
No ratings yet
Microservices Architecture Guide
14 pages
Azurekubernetesservicesdevcamponlinecontents1551289126300 PDF
100% (1)
Azurekubernetesservicesdevcamponlinecontents1551289126300 PDF
143 pages
DEVOPS Cheatsheet
No ratings yet
DEVOPS Cheatsheet
42 pages
Seminar On Microservices Architecture
No ratings yet
Seminar On Microservices Architecture
47 pages
Agenda: Netflix - Background and Evolution Monolithic Apps What Are Microservices? Microservices
No ratings yet
Agenda: Netflix - Background and Evolution Monolithic Apps What Are Microservices? Microservices
82 pages
1 Cloud S
No ratings yet
1 Cloud S
33 pages
Lecture 6-7
No ratings yet
Lecture 6-7
24 pages
Cloud App Design for Developers
No ratings yet
Cloud App Design for Developers
4 pages
Lecture 11 Cloud Systems
No ratings yet
Lecture 11 Cloud Systems
80 pages
Containerization and Orchestration - An Overview
No ratings yet
Containerization and Orchestration - An Overview
32 pages
12fa Docker Golang Sample
No ratings yet
12fa Docker Golang Sample
29 pages
Gambaru FromContainer2K8s Mar2022
No ratings yet
Gambaru FromContainer2K8s Mar2022
33 pages
12 Factor Applications With Docker and Go (Tit Petric) (Z-Library)
100% (2)
12 Factor Applications With Docker and Go (Tit Petric) (Z-Library)
148 pages
On-Premise vs Cloud: Key Differences
No ratings yet
On-Premise vs Cloud: Key Differences
41 pages
Introduction To Microservices: Architecture Principles, How To, Patterns, Examples
No ratings yet
Introduction To Microservices: Architecture Principles, How To, Patterns, Examples
78 pages
Azure Devops PPT by Suraj
No ratings yet
Azure Devops PPT by Suraj
157 pages
Microservices: Architecture For Modern Digital Platforms
100% (2)
Microservices: Architecture For Modern Digital Platforms
15 pages
Event-Driven Microservices
No ratings yet
Event-Driven Microservices
25 pages
Microservice Architecture Pattern
No ratings yet
Microservice Architecture Pattern
44 pages
Developing Microservices With Node - Js - Sample Chapter
100% (4)
Developing Microservices With Node - Js - Sample Chapter
29 pages
Itd 2
No ratings yet
Itd 2
2 pages
Unit 5
No ratings yet
Unit 5
33 pages
Presentation Notes
No ratings yet
Presentation Notes
7 pages
Speech For Slide 3
No ratings yet
Speech For Slide 3
6 pages
CloudComputing Module2
No ratings yet
CloudComputing Module2
17 pages
Ghemawat
No ratings yet
Ghemawat
8 pages
Service Fabric
100% (2)
Service Fabric
845 pages
Cloud Native Applications in A Telco World
No ratings yet
Cloud Native Applications in A Telco World
22 pages
OpenStack Architecture Design Guide
No ratings yet
OpenStack Architecture Design Guide
43 pages
Cloud - Computing Imp
No ratings yet
Cloud - Computing Imp
17 pages
Comprehensive Architecture Requirements Guide
No ratings yet
Comprehensive Architecture Requirements Guide
10 pages
3.2 Mobility Orchestration Serverless
No ratings yet
3.2 Mobility Orchestration Serverless
12 pages
DCA-Docker Certified Associate
No ratings yet
DCA-Docker Certified Associate
42 pages
Software Architecture Patterns
No ratings yet
Software Architecture Patterns
87 pages
5 Introduction To Cloud Native Principles
No ratings yet
5 Introduction To Cloud Native Principles
8 pages
1 Cloud S - Merged
No ratings yet
1 Cloud S - Merged
92 pages
01-Kubernetes Intro
No ratings yet
01-Kubernetes Intro
90 pages
Cloud Services and Platforms - Compute Services
No ratings yet
Cloud Services and Platforms - Compute Services
4 pages
Ian Cooper 12-Factor Apps in
No ratings yet
Ian Cooper 12-Factor Apps in
56 pages
Virtualization
No ratings yet
Virtualization
23 pages
30-Book-Microservices Best Practices For Java
No ratings yet
30-Book-Microservices Best Practices For Java
28 pages
0 Introduction - en
No ratings yet
0 Introduction - en
22 pages
Emerging CPE Technologies Overview
No ratings yet
Emerging CPE Technologies Overview
31 pages
Factor: 12 App Methodology
No ratings yet
Factor: 12 App Methodology
42 pages
Short Notes - Cloud
No ratings yet
Short Notes - Cloud
45 pages
Building 12 Factor App Microservices
No ratings yet
Building 12 Factor App Microservices
93 pages
Cloud Computing for Big Data Insights
No ratings yet
Cloud Computing for Big Data Insights
97 pages
Adv DevOps Viva Questions
No ratings yet
Adv DevOps Viva Questions
15 pages
Azure Service Fabric Guide
No ratings yet
Azure Service Fabric Guide
789 pages
Scalable Development Cluster at Adyen
No ratings yet
Scalable Development Cluster at Adyen
88 pages
SCD Chapter 9 Fall 2024
No ratings yet
SCD Chapter 9 Fall 2024
4 pages
Microservices & Netflix OSS
No ratings yet
Microservices & Netflix OSS
43 pages
CC 5
No ratings yet
CC 5
14 pages
Programming Microsoft Azure Service Fabric by Haishi Bai (2nd Edition)
No ratings yet
Programming Microsoft Azure Service Fabric by Haishi Bai (2nd Edition)
53 pages
Cloud Computing
No ratings yet
Cloud Computing
63 pages
Cloud Computing Basics Course Overview of Dell Course With Detailed Storage ND PROTECTION ANALYSIS WHICH IS REQUIRED FR PLACMENTSC
No ratings yet
Cloud Computing Basics Course Overview of Dell Course With Detailed Storage ND PROTECTION ANALYSIS WHICH IS REQUIRED FR PLACMENTSC
11 pages
Introduction To Platform-as-a-Service and Cloud Foundry!
No ratings yet
Introduction To Platform-as-a-Service and Cloud Foundry!
34 pages
Running Cloud Native Applications On Digitalocean Kubernetes
No ratings yet
Running Cloud Native Applications On Digitalocean Kubernetes
28 pages
Activity10 PDF
No ratings yet
Activity10 PDF
2 pages
Activity 11
No ratings yet
Activity 11
2 pages
Software Design Descriptions
No ratings yet
Software Design Descriptions
12 pages
Exp19 20
No ratings yet
Exp19 20
4 pages
TransformNation - VEGAN RECIPE GUIDE
No ratings yet
TransformNation - VEGAN RECIPE GUIDE
99 pages
Biochem - Enzymes Report Script
No ratings yet
Biochem - Enzymes Report Script
5 pages
Swords: History and Cultural Impact
No ratings yet
Swords: History and Cultural Impact
3 pages
AI Career Coach - Abstract
No ratings yet
AI Career Coach - Abstract
3 pages
Work Breakdown Structure Guide
No ratings yet
Work Breakdown Structure Guide
24 pages
English CG
No ratings yet
English CG
247 pages
Citing - Citation With Some Text Inside Square Brackets - TeX - LaTeX Stack Exchange
No ratings yet
Citing - Citation With Some Text Inside Square Brackets - TeX - LaTeX Stack Exchange
1 page
Prabowo's Cabinet Lineup
No ratings yet
Prabowo's Cabinet Lineup
4 pages
Set 1
No ratings yet
Set 1
21 pages
Bv0023 - Environmental Noise Measurements
No ratings yet
Bv0023 - Environmental Noise Measurements
44 pages
Diagnostic Capabilities
No ratings yet
Diagnostic Capabilities
16 pages
Andrew Aguecheek: Background and Character
No ratings yet
Andrew Aguecheek: Background and Character
2 pages
Larsen & Toubro Financial Analysis 2024
No ratings yet
Larsen & Toubro Financial Analysis 2024
11 pages
Turbosan Norm
No ratings yet
Turbosan Norm
6 pages
Lexical Differences
No ratings yet
Lexical Differences
7 pages
WACC and Cost of Capital Analysis
No ratings yet
WACC and Cost of Capital Analysis
7 pages
Direction Sense 50 Questions& Explanations
No ratings yet
Direction Sense 50 Questions& Explanations
18 pages
Interloop Strategic Marketing Plan
100% (1)
Interloop Strategic Marketing Plan
5 pages
Topics - Professional Ethics (DR Nidhi Chauhan Revisions)
No ratings yet
Topics - Professional Ethics (DR Nidhi Chauhan Revisions)
3 pages
Knowledge Needs and The Savvy Child Teenager Perspectives On Banning Food Marketing To Children
No ratings yet
Knowledge Needs and The Savvy Child Teenager Perspectives On Banning Food Marketing To Children
14 pages
General Agriculture Mock Test
No ratings yet
General Agriculture Mock Test
7 pages
DT 4.5 Days Plan & Imp Topics - Yash Khandelwal
No ratings yet
DT 4.5 Days Plan & Imp Topics - Yash Khandelwal
5 pages
Subject: Office Automation DIT Part 1st: Ms Excel 2007
No ratings yet
Subject: Office Automation DIT Part 1st: Ms Excel 2007
41 pages
Ieee Paper 2
No ratings yet
Ieee Paper 2
6 pages
CBSE Class 10 Science Answer Key 2025 - Download PDF Solutions For All Sets
No ratings yet
CBSE Class 10 Science Answer Key 2025 - Download PDF Solutions For All Sets
17 pages
Bca 601
50% (2)
Bca 601
206 pages
Water-Well Drilling Rig Specs
No ratings yet
Water-Well Drilling Rig Specs
6 pages
Honda Non-Gear Two-Wheeler Satisfaction Study
No ratings yet
Honda Non-Gear Two-Wheeler Satisfaction Study
72 pages
Nayankara System
No ratings yet
Nayankara System
7 pages
Vision-Mission of ANDSC Schools
No ratings yet
Vision-Mission of ANDSC Schools
7 pages

Lecture Notes

Uploaded by

Lecture Notes

Uploaded by

Architecting Distributed

Architecting Distributed Cloud Apps

cost-effective, failure-resilient solutions

 A fault domain is a unit of failure

Each Service’s 1 2 3 n Services

 Periodically check queue length/resource usage

Not Not Not

Container Container Hyper-V Hyper-V C-1 C-2 C-3 C-4

OS Kernel (Linux/Windows) OS Kernel

Modern DevOps is all about automation; any

 Retry & Idempotency

Hot/ Warm Cold

Service Service Non-Relational

 Each partition’s data is replicated for

Consistency: if enough stores don’t ack

Queue Saga Hotel

You might also like