Network Design: Architecting With Google Cloud Platform: Design and Process
Network Design: Architecting With Google Cloud Platform: Design and Process
Agenda
Network configuration for data transfer within the service
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
3
Load Balancing
Caching
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
4
No more than 6-7 round trips between Europe and the US per second are possible,
but approximately 2000 per second can be achieved within a datacenter.
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
Location is significant. Only in this case, you pay more for something that is farther
away.
Note: Describes VM-to-VM communications inside the Google Network.
You can use performance testing tools such as iperf to test timing.
5
Load Balancing
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
Network speed is just one factor in throughput. Network location is key. Parallelism is
another factor. And load balancing combines both
https://pixabay.com/en/meditation-stone-towers-stone-tower-2262835/
6
GLOBAL
HTTP(S) SSL Proxy TCP Proxy
REGIONAL
Network Internal
In general, for traffic originating externally, stick to the protocol-named service that is
designed for and optimized for that protocol, unless you have a compelling reason.
For multi-tier internal traffic, use the internal load balancing service.
Then use the more general network load balancing service for anything else.
Internal yes
Is yes
pass-through
Network required?
no
HTTP(S)
SSL Proxy
© 2018 Google Inc. All rights reserved. Google and the Google logo TCP Proxy Network
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 8
HTTP(S) load balancing: You can configure URL rules that route some URLs to one
set of instances and route other URLs to other instances. Requests are always routed
to the instance group that has capacity and is closest to the user.
https://cloud.google.com/compute/docs/load-balancing/http/
SSL proxy: A proxied global load balancing service that automatically directs SSL
traffic to the closest region that has capacity.
https://cloud.google.com/compute/docs/load-balancing/tcp-ssl/
TCP proxy: Terminates IPv4 and IPv6 and initiates an IPv4 connection to the
backend servers.
https://cloud.google.com/compute/docs/load-balancing/tcp-ssl/tcp-proxy
Network load balancing allows you to balance load of your systems based on
incoming IP protocol data, such as address, port, and protocol type.
https://cloud.google.com/compute/docs/load-balancing/network/
Internal load balancing enables you to run and scale your services behind a private
load balancing IP address which is accessible only to instances internal to your Virtual
Private Cloud (VPC). https://cloud.google.com/compute/docs/load-balancing/internal/
Network load balancing was used for internal load balancing before the internal load
balancing service was available. Configuration is significantly more complicated with
network load balancing because you have to restrict access to the VPC using firewall
rules and routes. You also must be plan for capacity of the load balancer itself,
because choke points are possible and the load balancer could reach capacity and
impact availability. If there is some reason Internal load balancing won't work in your
situation, network load balancing is still an alternative. However, there are no
common use cases.
8
Multi-cloud
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
9
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
To get a global static IP address for a GCP resource, configure global load balancing.
12
Cloud CDN
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
In certain circumstances caching can be a design issue. For example, if a value your
application relies on is cached and you want to roll out a new version of the
application that changes the value, the cached value could create issues that are
difficult to troubleshoot.
If you decide to use a 3rd party or open source cache as part of your solution, please
investigate cache management.
Dedicated Interconnect
Another Cloud
Development Development
Shared Virtual Private Cloud
Cloud Partner
Interconnect
Direct Connect
Cloud Router
Router
Production
Production
Shared Virtual Private Cloud
Cloud Partner
Interconnect
Direct Connect
Cloud Router
Router
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
https://cloud.google.com/interconnect/docs
12
VPN configurations
Reliability configuration Aggregate capacity configuration
PEER Network
PEER Network
VPN Gateway
Gateway
VPN
Gateway
VPN Gateway
Gateway Gateway
Two VPN gateways connect to the same peer IP. Forward the same IP range to two peer gateways
Traffic is load balanced between the two VPN gateways. Traffic is load balanced over the tunnels, combining the capacity
If one path is lost the other takes over. Max: 3 Gbps per tunnel over direct interconnect 1.5 Gbps over internet
PEER Network
VPN
Cloud Router
Gateway
Gateway
VPN
Gateway
Gateway Adds BGP dynamic
discovery of routes
https://cloud.google.com/compute/docs/vpn/advanced
13
VPN Performance
Verify that the capacity of the peer devices matches the VPN gateways
There are many settings, including MTU, which is normally dynamically set
If you are measuring throughput over VPN, use multiple TCP streams
● iperf -P
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
https://cloud.google.com/compute/docs/vpn/advanced#recommended_measures_to_i
ncrease_vpn_throughput
14
Periodic slowdown
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
Okay, so let's go back to our photo service. In this case, we have a periodic
slowdown, which means that under certain conditions the service is very slow, but at
other times it's fast. So, what could be causing this irregularity? What's causing the
service itself to slow down? And, what can we do to fix it?
https://pixabay.com/en/summer-sunflower-flowers-sky-cloud-368224/
The system is slow. It is taking minutes to generate thumbnails
However, during peak periods there appears to be a slowdown and it can take up
to several minutes after submitting a photo for the thumbnail image to be
returned.
The Web Dev team thinks the problem is in the thumbnail application code.
The App Dev team thinks the problem is in the web server application code.
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 15
So, the thumbnail service is growing and this can be seen through the number of
thumbnails being generated, which is great. We're starting to get more popularity, but
monitoring our log processing shows that there appears to be a slowdown during
peak periods, and it can take up to several minutes after submitting a photo for the
thumbnail image to be returned.
Now, this doesn’t really happen, but let’s take a fictitious scenario where a company
has groups of teams that don't really get along, or will blame each other. So in this
case, the Web Dev team thinks that the problem is the thumbnail application code.
Well, guess what? The people who wrote the code - the App Dev team - think the
problems in the web server application code, because it might not be handling
sessions and so on.
But there are other teams that were impacted too. The Support team are dealing with
user calls, so they're calling and asking for help. The Operations team doesn't have a
procedure to fix the problem because they're the ones who manage the deployment
and the production servers. They're not sure if it's the Web teams fault or the App
team's fault. So, who's going to fix this?
18
PROCESS
Find and fix the real problems
The root cause is always:
● Systems
● Processes BLAMELESS CULTURE
● Behaviors
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
Google learned that the reliability of the service depends on how people work
together to fix problems. Every new system or upgraded system goes through a
period of stabilization. During that period you will need to respond to problems, find
the root causes, and address the problems. If you stop looking after you have
assigned blame to a person, but don't continue digging until you get to the systems,
processes, or behaviors that must be changed -- you will leave the system broken,
and it will not stabilize.
If the analysis had stopped at 2, the person might have been punished without solving
the core system problem, which was absent procedures.
The service doesn't stabilize if you don't find and fix the real problems.
Learning together
Outages happen. And what may be clear to one person may not be clear to another.
Some outcomes would not have been anticipated by anyone.
Blameless
Blame makes people afraid to bring real issues to light and is detrimental to a learning
culture.
People are NEVER the root cause. There is something in the system, in the
processes, or in the behaviors that IS the root cause and needs to be identified and
fixed or mitigated.
https://pixabay.com/en/pointing-accusation-accuse-blame-1991215/
17
PROCESS
Policy for writing postmortem reports
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
It is important that teams learn from mistakes. Each postmortem written and read
reduces the chances of repeating mistakes. Postmortem reports become a method
for training people.
Refresher
Upload Thumbnail
Server Server
Data Storage
Service
Thumbnail
Image Conversion Thumbnail Serving
Ingest
Storage (Processing) Storage Thumbnails
User Experience
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 18
The system is slow. It is taking minutes to generate thumbnails
After systematic and logical troubleshooting, and answering the "five why's", the
team determines that the issue is definitely tied to the capacity of the system to
generate thumbnails.
The front-end web service is not causing delays. Only the back-end thumbnail
generating service, which is failing to keep up with demand.
CPU utilization is non-linear. During busy times, the utilization goes to 100%, which
impacts the end to end response time for the user.
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 19
Keep in mind that CPU utilization is not to be used as a service level indicator. It is not
a direct measurement of customer pain.
Scale the backend processing of thumbnails
Thumbnail Servers
Thumbnail
Image Conversion Thumbnail Serving
Ingest
Storage (Processing) Storage Thumbnails
Data Storage
User Experience Server
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 20
So, here was our decision. We decided that if we need to handle more thumbnail
processing, it's got to become more scalable. However, we didn't choose to simply
throw more CPU and network at it because it was more of a single point of failure.
Instead, we decided to add a load balancer and scale out the thumbnails servers. The
great thing is that it's like a microservice in itself now. Because storage has been
isolated to Google Cloud Storage, the same code can be distributed and it doesn't
keep track of a queue or anything else. The upload server basically pulls whatever is
on the data storage server, and load balances it as they come in. Technically, this is
probably an internal load balancer, but we'll get into that a little bit later. In this case
here, to help us with our greater than 80 percent CPU utilization, we want to distribute
traffic requests from our business logic to the application servers in a cluster.
Objectives and Indicators
Objectives Indicators
Availability, 23/24 hours/day = 95.83% availability Server up/down time
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 21
Even though we’ve added a cluster of servers, we haven’t changed anything that the
user can measure. The performance is still a measure of the end to end latency, and
the accuracy of the service is still based on the error logs.
We didn’t need to adjust the SLOs because they are not based on the CPU load of
the backend. Rather, the SLOs are based on the user experience.
The autoscaling will help alleviate the CPU bottleneck because if the pool gets
saturated it will autoscale.
That problem has been resolved. A new problem it reveals, however, is how long it
will take for the autoscaling to catch up to the user demand. If the user demand is
gradual, autoscaling will have no problem keeping up with demand. But if the demand
is extremely bursty, other techniques and settings might be necessary: for example,
might need to keep capacity at N+1 servers to give time for the pool to start up
another server; changing the autoscaling trigger value; or using more sophisticated or
custom metrics.
22
YOUR TURN
Design challenge #3
Growth
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
https://pixabay.com/en/the-strategy-win-champion-1080527/
App logs are growing. Logging server can't keep up
App
App Logs
Web Data Stg
Logs Logs Logs
App App
Logs Logs
Logs Logs
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 23
If you expect to quickly outgrow local CPU, what is a way to scale the processing
capability of the logic?
24
Problem: Autoscaling of the application servers have produced logs that are
outgrowing the processing capacity of the aggregation logging server.
Design a solution.
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
25
One solution
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
App
App Logs
Web Data Stg
Logs Logs Logs
App App
Logs Logs
Cloud Load
Balancing
Storage Service Logging Server
Logs Logs
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated. 26
GCP lab
Lab 3: How to move from an instance to an instance template, add an instance group, autoscaling,
and a load balancer. (Echo application).
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
28
Lab Deployment
autoscaling
instance group
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
29
© 2018 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names
may be trademarks of the respective companies with which they are associated.
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.