VMMIG - Module05 - Optimize Phase
VMMIG - Module05 - Optimize Phase
Optimize costs
1.
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
The Optimize phase is where it gets cloudy
Having moved your workloads, you can now update to fully exploit the cloud.
https://cloud.google.com/solutions/image-management-best-practices
Start with a base OS installation, or if building images for GCP, start with a public boot
image
Periodically, take the base image and harden it by removing services, change
settings, installing security components, etc. Build the hardened image every 90 days,
or whatever frequency makes sense in the organization. This becomes that basis of
subsequent builds.
More frequently, build platform-specific images. One image for web servers, one for
application servers, one for databases, etc. Build this image maybe every 30 days.
As frequently as you build an app, create new VM images for the new versions of the
application. You might create new application images on a daily basis.
Goals for image management
Servers
Compute Engine
Bespoke 2 Server 2
Compute Engine Compute Engine
v1.0
Bespoke 3 Server 3
Boot image
Compute Engine Compute Engine
v1.1
Bespoke 4 Server 4 Boot image
Compute Engine Compute Engine
After migration, you have servers with independent configurations. They may, or may
not, be managed with a configuration management solution. However, each is
managed as a unique asset.
By updating the servers to all use a consistent base image, you ensure uniform
configuration across multiple instances. You also make it possible to combine like
servers into managed instance groups. This provides benefits such as:
- Health checks
- Ability to resize the cluster easily
- Autoscaling (for workloads that will scale horizontally)
- A cloud-native approach to VM updates - that is, use of immutable images.
This, combined with the rolling update feature of MIGs, makes rolling out new
versions easy.
Organizational maturity
?
Web image
DB image
Robust systems will be run much like a standard DevOps pipeline. Commits to a code
base will trigger build jobs, which will create/test/deploy images. The image building
tool can leverage configuration management systems to automate the configuration of
the image.
Many customers will have some version of the second option, with a set of images
that may be built manually or with partial automation. They don't get built as often,
and certainly not daily.
Some customers will have hand-crafted servers, and have no existing process in
place for creating/baking images.
GCP Images
https://cloud.google.com/solutions/image-management-best-practices
There are three main approaches to creating GCP boot images that can be used for
managed instance groups.
Best: Build an image from the ground up, starting with a public image. Develop a
clean CI/CD pipeline for generating these images, using tools like Packer and
Chef/Puppet/Ansible.
Good: Use existing image-generation pipelines and produce output images for GCP.
Tools like Packer and Vagrant that are being used to produce VMware images can
also output these images for use with GCP.
Not-so-good (some would say bad): Take the migrated VMs disk and create an image.
Then manually prune and tailor the image.
How much configuration is baked in?
https://cloud.google.com/solutions/image-management-best-practices
There are many variables that go into deciding how much you bake into an image:
https://cloud.google.com/community/tutorials/create-cloud-build-image-factory-using-p
acker
You will need to make a few tweaks to the demo to get it to work:
The image creation process takes about 5 minutes. While waiting, take time to explain
how Cloud Build works.
Packer can't SSH in successfully on instances where OS-Login is enabled. Make sure
the metadata to enable this feature is not set on the project where you are demoing.
Optimizing for configuration management
Disable network, authorization, firewall Disable network, authorization, firewall
management in playbooks management in playbooks
On premises On premises
CM Server Routes, firewall rules,
bandwidth
Compute Engine
CM Server CM Server
Migrated VM Migrated VM
Compute Engine Compute Engine
VM VM
Migrated VM Migrated VM
Compute Engine Compute Engine
VM VM
Migrated VM Migrated VM
Compute Engine VM Compute Engine VM
https://cloud.google.com/solutions/configuration-management/
As noted in the module on the Plan phase, companies should really have
configuration management for their on-prem assets in place prior to migrating VMs
into the cloud.
When extending infrastructure into the cloud, one common approach is to place CM
servers in the cloud as well. You configure the on-prem servers to manage the
on-prem inventory, and the cloud servers to manage the cloud inventory. You then
have either separate playbooks for the different environments, or adaptable playbooks
that use environment-specific variables or context to perform slightly different
configuration depending on whether the VM is in the cloud or on-prem.
For VMs migrated into GCP, you'll want to remove the normal CM commands that
configure network, firewall, and authorization settings as they will be managed
differently in GCP.
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Optimizing for scaling and release management
https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-manag
ed-instance-groups
https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-organizations#
high-availability
https://cloud.google.com/docs/geography-and-regions
Regional instances groups will distribute instances created from your template across
zones. If a zone goes down, the instances in other zones remain available. If a zone
goes down, the regional MIG will not automatically create additional instances in the
remaining zones (unless autoscaling is enabled). An alternative is to use multiple
zonal managed instance groups.
Not mentioned on slide, but also important, is ensuring you have a high-availability
interconnect between GCP and your on-premises networks. This should have been
handled during the Plan phase.
Optimizing for disaster recovery
● The lower the tolerance for loss, the higher the cost and complexity
● Options include…
○ Cold: rebuild app in another region
○ Warm: unused app in another region
○ Hot: app runs across regions
https://cloud.google.com/solutions/dr-scenarios-planning-guide
DR: Cold pattern
Region 1 Region 2
App App
Compute Engine Compute Engine
App DB App DB
Cloud SQL Cloud SQL
Deployment
Manager
MR Bucket
Cloud Storage
https://cloud.google.com/solutions/dr-scenarios-planning-guide
The original environment is deployed using Infrastructure as Code (IaC). The app is
implemented using a managed instance group and instance templates. The database
is backed up periodically to a multiregional bucket.
If a regional fails, the application can be redeployed fairly quickly into a new region,
the database can be restored from the latest backup, and the load balancer can be
reconfigured with a new backend service.
RTO is bounded typically by the time required to restore the DB. RPO is bounded by
how frequently you perform database backups.
DR: Warm pattern
Region 1 Region 2
App App
Compute Engine Compute Engine
App DB Replica
Cloud SQL Cloud SQL
https://cloud.google.com/solutions/dr-scenarios-planning-guide
App deployments are made into multiple regions, but failover regions have smaller
application MIGs which don't serve traffic. A DB replica is created in the failover
region, and this receives replication traffic from the DB master, keeping it nearly
entirely up-to-date.
In the case of failure, update the load balancer to include the region 2 instance group
as a backend, increase the size of the instance group, and point the app to the replica
(this could be done via dns changes, or by placing a load balancer in front of the DB
and changing the load balancing configuration).
This design reduces the RTO and RPO significantly. However, it does introduce
cross-regional replication costs.
DR: Hot pattern
Region 1 Region 2
App App
Compute Engine Compute Engine
App DB
Cloud Spanner
https://cloud.google.com/solutions/dr-scenarios-planning-guide
App deployment occurs in multiple regions. The load balancer does geo-aware
routing of requests to the nearest region. The backing database service, Spanner,
handles replication across regions.
https://www.youtube.com/watch?v=1ibeCQjjpBw&autoplay=1
https://forsetisecurity.org/about/
https://cloud.google.com/vpc/docs/firewalls#service-accounts-vs-tags
https://cloud.google.com/blog/products/gcp/simplify-cloud-vpc-firewall-management-w
ith-service-accounts
Optimizing load balancing
high-availability
App
Compute Engine
● Proxy-based load balancers offer
cross-regional routing
On premises
● Hybrid load balancing can be DNS
https://cloud.google.com/load-balancing/
Optimizing security at the edge
https://cloud.google.com/files/GCPDDoSprotection-04122016.pdf
https://cloud.google.com/armor/
Optimizing secret management
Google APIs
https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-in
stances
https://cloud.google.com/kms/
https://cloud.google.com/hsm/
https://cloud.google.com/kms/docs/encrypt-decrypt
Apps running on instance can use the VM-assigned service account by using Cloud
Libraries; credentials are passed to the application via metadata.
Apps can leverage Cloud KMS to decrypt secrets that are either stored in app
configuration or in GCS. The application operates within the context of a service
account. That account has permissions to use a given key. That key is used by Cloud
KMS to encrypt/decrypt secrets. The diagram shows a secret being stored in GCS,
and the app asks KMS to read and decrypt the file.
Optimizing IAM configurations
https://cloud.google.com/policy-intelligence/
Logging and monitoring for security
● Google services write logging and metrics data into Cloud Logging
○ VMs with agents installed write guest OS and application data as well
○ This data can flow through to your logging/monitoring tools of choice
https://cloud.google.com/logging/docs/export/
https://cloud.google.com/solutions/exporting-stackdriver-logging-for-splunk
https://www.splunk.com/blog/2016/03/23/announcing-splunk-add-on-for-google-cloud-
platform-gcp-at-gcpnext16.html
https://resources.netskope.com/cloud-security-collateral-2/netskope-for-google-cloud-
platform
https://help.sumologic.com/03Send-Data/Sources/02Sources-for-Hosted-Collectors/G
oogle-Cloud-Platform-Source
https://cloud.google.com/logging/docs/audit/
https://cloud.google.com/vpc/docs/using-flow-logs
https://cloud.google.com/vpc/docs/firewall-rules-logging
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Optimizing with managed services
CloudSQL MySQL Cost, not on VPC, see differences and issues docs
https://cloud.google.com/sql/docs/mysql/features#differences
https://cloud.google.com/sql/faq
Cloud SQL costs roughly 2x the cost of un-managed MySQL running on a VM. Cloud
SQL VMs are not on a VPC in the project; they are accessed via peering or public IP.
https://cloud.google.com/pubsub/architecture
https://cloud.google.com/pubsub/docs/faq
https://cloud.google.com/pubsub/docs/ordering
https://cloud.google.com/pubsub/pricing
https://cloud.google.com/dataproc/pricing
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
https://cloud.google.com/dataproc/docs/resources/faq
Dataproc's $0.01/vcpu/hr. charge adds up on very large clusters that are long lived.
In general, the online documentation does a good job of detailing key issues. Review
the concepts section, the known issues section, and the pricing. Also, Googling <gcp
product> vs. <other product) often yields good initial results.
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Make sure you tailor instance sizes to real needs
Sizing recommendations are currently not available for: VM instances created using
App Engine Flexible Environment, Cloud Dataflow, or Google Kubernetes Engine; or
VM instances with ephemeral disks, GPUs, or TPUs.
The sizing recommendation algorithm is less suited to workloads that spike less
frequently (for example, monthly spikes) because 8 days of data is not enough to
capture or predict the processing fluctuations.
https://cloud.google.com/compute/docs/instances/apply-sizing-recommendations-for-i
nstances
There's more to TCO than VM costs
Persistent Disk
Network egress
Intra-VPC traffic
costs
Load balancer
costs
https://cloud.google.com/billing/docs/how-to/export-data-bigquery
BigQuery is hugely useful for analyzing billing data. It can be used to find large,
and potentially unexpected, sources of cost - which can then be optimized.
Watch network costs
Remember that traffic transferred within a VPC, but across zones or regions incurs
costs (in addition to the more obvious VPC egress).
https://cloud.google.com/vpc/docs/vpc-peering
https://cloud.google.com/vpc/docs/shared-vpc
https://cloud.google.com/network-tiers/
Working with budgets
https://cloud.google.com/billing/docs/how-to/budgets
https://cloud.google.com/billing/docs/how-to/notify
https://cloud.google.com/bigquery/docs/custom-quotas
https://cloud.google.com/appengine/pricing#spending_limit
Lab 13
Defining an optimization strategy
Agenda
Introduction
Image strategies and configuration management
Managed Instance Groups
Availability and disaster recovery
Networking and security consolidation
Managed services
Cost optimization
Migrate for Anthos (VMs to containers)
Moving VMs into containers
Why Kubernetes/GKE?
Secure kernel
Density
Resiliency
Modernization
● Google offers automatic updates, which keeps the kernel on the machines
running you apps secure.
● You can run more apps on a given host for better resource utilization.
● If a node goes down, workload is rescheduled to another node, and quickly
● Istio, for example, allows you to make service discovery, traffic splitting,
authorization, circuit break patterns, and other features easy to implement
without having to rewrite apps.
● Teams can get experience using GKE and K8s without having to totally
re-engineer their apps.
Migrate for Anthos moves VMs to containers
Control
● Migration CRD, console, CLI
Workload Execution
● Linux -> Anthos runtime embedded in container image, replacing VM's kernel
● Windows -> generates Dockerfile which builds Windows Server container image
Maintenance
● Dockerfile can be used in CI/CD pipeline that build updated container images
- For GCP, AWS, and Azure source, the processing cluster can be either
GKE or Anthos GKE on Google Cloud cluster; for VMware sources, you
need a Anthos GKE on-prem cluster
- For Linux workloads, you can manage the entire process within the
console (rather than the old CLI installation and migration processes)
- For more details, see
https://cloud.google.com/migrate/anthos/docs/architecture
Migrate for Anthos Architecture
Migrate for Anthos Production project
CRD SC CSI ... Docker yaml yaml yaml image image app
1. Migrate for Compute Engine creates the pipeline for streaming/migrating data
from on-prem or cloud systems into GCP.
2. Migrate for Anthos is installed on a GKE processing cluster, and is comprised
of many Kubernetes resources.
3. Migrate for Anthos is used to generate deployment artifacts
a. Some, like the Kubernetes configurations and the Docker file used to
create the VM-wrapping container go into GCS
b. The container images themselves are stored in GCR
4. Once the deployment assets have been created, then can be used to test or
deploy into a target cluster
a. You simply apply the generated configuration and it creates the
necessary Kubernetes elements on the target cluster
Keep in mind that the bottom half of the diagram doesn't apply when migrating from
GCE.
Migration works with specific OS and GKE versions
Supported workload OSes
Ubuntu
COS (ext2/3/4)
(GKE node OS for processing cluster)
https://cloud.google.com/migrate/anthos/docs/supported-os-versions
Migrations follow a typical path
1 2 3 4 5 6
Configure Add Migration Generate and review Generate artifacts Test Deploy
Processing Cluster Source plan
Create cluster, install Contains details needed Create a migration object Generate container Test the container Use the generated
Migrate for Anthos to migrate from VMware, with details of the images and YAML files images, test deployment artifacts to deploy to
components AWS, Azure, or GCP migration. Customize the for deployment. production clusters.
generated plan as needed
https://cloud.google.com/migrate/anthos/docs/migration-journey
Migrate for Anthos requires a processing cluster
https://cloud.google.com/migrate/anthos/docs/configuring-a-cluster
Using the Console to add a processing cluster provides a set of Cloud Shell
commands that are used to complete the installation of Migrate for Compute Engine.
You can run each command by clicking on the RUN IN CLOUD SHELL buttons.
The first command enables all the services required for Migrate for Anthos to operate.
Configuring access
roles/clustroles daemonsets
rolebindings/
job
clusterrolebindings
Processing Cluster
Kubernetes Engine
configmaps storageclass
CRD
● The gcloud command configures kubectl to work with the processing cluster
● The migctl command installs the CRDs and creates resources:
○ namespace/migrate-system created
○ namespace/v2k-system created
○ storageclass.storage.k8s.io/v2k-generic-disk created
○ customresourcedefinition.apiextensions.k8s.io/migrations.anthos-migra
te.cloud.google.com created
○ serviceaccount/csi-vlsdisk-csi-controller-sa created
○ serviceaccount/csi-vlsdisk-csi-node-sa created
○ serviceaccount/v2k-generic-csi-csi-controller-sa created
○ serviceaccount/v2k-generic-csi-csi-node-sa created
○ serviceaccount/v2k-reconciler-sa created
○ serviceaccount/validator-sa created
○ role.rbac.authorization.k8s.io/csi-vlsdisk-node-healthcheck-pods-role
created
○ role.rbac.authorization.k8s.io/v2k-leader-election-role created
○ clusterrole.rbac.authorization.k8s.io/csi-vlsdisk-controller-role-vls
created
○ clusterrole.rbac.authorization.k8s.io/csi-vlsdisk-driver-registrar-role
created
○ clusterrole.rbac.authorization.k8s.io/csi-vlsdisk-node-healthcheck-role
created
○ clusterrole.rbac.authorization.k8s.io/v2k-generic-csi-controller-role-vls
created
○ clusterrole.rbac.authorization.k8s.io/v2k-generic-csi-driver-registrar-rol
e created
○ clusterrole.rbac.authorization.k8s.io/v2k-manager-role created
○ clusterrole.rbac.authorization.k8s.io/v2k-proxy-role created
○ clusterrole.rbac.authorization.k8s.io/validator-role created
○ rolebinding.rbac.authorization.k8s.io/csi-vlsdisk-node-healthcheck-pod
s-binding created
○ rolebinding.rbac.authorization.k8s.io/v2k-leader-election-rolebinding
created
○ clusterrolebinding.rbac.authorization.k8s.io/csi-vlsdisk-controller-attach
er-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/csi-vlsdisk-controller-provisi
oner-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/csi-vlsdisk-controller-secret
-access-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/csi-vlsdisk-driver-registar-bi
nding created
○ clusterrolebinding.rbac.authorization.k8s.io/csi-vlsdisk-node-healthchec
k-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/v2k-generic-csi-controller-a
ttacher-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/v2k-generic-csi-controller-p
rovisioner-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/v2k-generic-csi-controller-s
ecret-access-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/v2k-generic-csi-driver-regis
tar-binding created
○ clusterrolebinding.rbac.authorization.k8s.io/v2k-manager-rolebinding
created
○ clusterrolebinding.rbac.authorization.k8s.io/v2k-proxy-rolebinding
created
○ clusterrolebinding.rbac.authorization.k8s.io/validator-role-binding
created
○ configmap/default-logs created
○ configmap/exporter-default-config created
○ configmap/v2k-manager-config-map created
○ service/v2k-controller-manager-metrics-service created
○ service/v2k-controller-web-server created
○ deployment.apps/v2k-controller-manager created
○ statefulset.apps/csi-vlsdisk-controller created
○ statefulset.apps/v2k-generic-csi-controller created
○ daemonset.apps/csi-vlsdisk-node created
○ daemonset.apps/v2k-generic-csi-node created
○ job.batch/admission-validator created
This process takes a couple minutes to complete, and can be checked using the
provided command: gcloud container clusters get-credentials
migrate-processing-cluster --zone us-central1-f --project
qwiklabs-gcp-01-8415e218f0d2 && migctl doctor
Adding a source enables migrations from a specific
environment
You select the processing cluster, the source type, the target project for migrated
workloads, and a service account used to do the migration (this can be created for
you as part of install)
https://cloud.google.com/migrate/anthos/docs/creating-a-migration
Creating a migration generates a migration plan
You should review the migration plan file resulting from creating a migration, and
customize it before proceeding to executing the migration. The details of your
migration plan will be used to extract the workload container artifacts from the source
VM, and also to generate Kubernetes deployment files that you can use to deploy the
container image to other clusters, such as a production cluster.
Generating artifacts
https://cloud.google.com/migrate/anthos/docs/executing-a-migration
● Migrate for Anthos creates two new Docker images: a runnable image for
deployment to another cluster and a non-runnable image layer that can be
used to update the container image in the future. See Customizing a migration
plan for information on how to identify these images.
● It also generates configuration YAML files that you can use to deploy the VM
to another GKE cluster. These are copied into a Cloud Storage bucket as an
intermediate location. You run migctl migration get-artifacts to download them.
Deployment files typically need modification
● The configuration defines
migctl migration get-artifacts demo-migration
resources to deploy
○ Deployment or StatefulSet
○ Headless service
○ PersistentVolumes and
PersistentVolumeClaims
https://cloud.google.com/migrate/anthos/docs/review-deployment-files
Apply the configuration to deploy the workload
https://cloud.google.com/migrate/anthos/docs/review-deployment-files
Best Practices
Web/Application servers
High-perf databases DNS
Business logic layer
In-memory databases NFS
Multi-VM, multi-tier
https://cloud.google.com/migrate/anthos/docs/planning-best-practices?authuser=0
DNS
- Replace host names with service names (and create services)
- Handling /etc/hosts files requires special effort
- Figure out how discovery will work, and if you need to place services in same
namespace
NFS
- Must modify YAML files to mount NFS
- Can't run with kernel mode NFS servers
- Data from NFS mount is not automatically migrated
Environment variables
- If your applications rely on injected metadata (for example, environment
variables), you will need to ensure that these are available on GKE. If the
same metadata injection method is not available, GKE offers ConfigMaps and
Secrets.
Runlevels
- Migrate for Anthos workloads reach runlevel 3 only. VMs migrated into GKE
with Migrate for Anthos will be booted in the container at Linux runlevel 3.
- Certain services (for example X11 or XDM, for remote GUI access using VNC)
are configured by default to start only at runlevel 5. Any necessary services
should be configured to start at runlevel 3.