Patch Management Automation For Enterprise Cloud
Patch Management Automation For Enterprise Cloud
Ashu Gupta, Madhu Sudhan KN, Fazal Feroze, Rajesh Garg, Sumithra Ravichandran
IBM Global Technology Services
Abstract
Applying patches to operating systems, middleware, and applications is considered a major IT pain
point due to several reasons. The operating systems and software are of myriad types, there is
interdependency among the updates, operating system, and applications, there is lack of
standardization among different enterprise customers, and finally testing the applications and
operating systems post-update is challenging. As a result, human operator is involved in different
stages of the patching process, making it costly and cumbersome. Cloud can help standardize
various offerings to customers, and potentially remove human operators. However, it introduces other
challenges such as VM time zones and restoring VMs from snapshots which are not present in
traditional enterprise environments. We discuss the challenges of achieving patch automation in a
Cloud, and then describe our solution.
What is patching?
• Software patches
– Security: Fix vulnerabilities
– Functional: Add features, improve existing functions, change software behaviors
– Impacted systems: Hypervisor, OS, middleware, applications
• Vendor Tools: Windows Server Update Services, Redhat Network, etc.
– Only handles Windows/Redhat systems
– Standalone tool with no integration with other management tools, e.g., change management
• 3rd Party Tools: IBM TEM, VMware vSphere update manager
– TEM: Hypervisor, guests, middlewares, apps. vSphere: hypervisor & guests
– Scheduling & deployment made easy but largely rely on human involvement, no tool
integration
• Amazon EC2, Microsoft Azure, many private Clouds
– User manages patching. VMs can be potentially vulnerable when first provisioned
• Rackspace
– Patching is a part of managed service ($0.12 per hour, $100 per month)
Section 1: Introduction
Operating systems, middleware, and applications need to be regularly patched to guard against
newly found vulnerabilities or to provide additional functionality. In the non-enterprise space, updates
are typically handled by turning on the auto-update feature of the operating system or middleware,
which can apply the patches as they are released by vendors. However, these mechanisms for
automatically updating are problematic in an enterprise environment. First, IT administrators need to
assess the impact of newly released patches before rolling them out to their IT infrastructure. Such
pre-assessment is clearly not possible using the automatic update feature. Second, IT administrators
need to have a consistent view of their IT infrastructure, including the vulnerability assessment which
cannot be achieved by the automatic updates of vendors ISVs (independent software vendors).
Lastly, IT administrators need to assess the post patch impact, including failures of existing
applications running on their IT infrastructure. All these changes should be recorded in a change
management system for audit and recovery from failure purposes (integration w/ other management
tools).
Section 1.1: Patch Management Tools
Broadly, there are two different types of patch management tools. There include ISV-vendor specific
tools and third party tools. As discussed in Introduction, each ISV vendor has its own mechanism for
updating its software. However, by definition, those mechanisms are vendor specific. Examples of
these vendors include Windows server update services (WSUS) [WSUS’11] and Red Hat Network
(RHN) [RHN’11]. The third party tools such as Tivoli Endpoint Manager (formerly BigFix) [TEM’11] or
VMware vCenter Update Manager [VMwareUpdate’11]. Although these tools make the scheduling
and deployment of patches relatively easy, but they still rely on humans to schedule the patches.
Moreover, these tools are not integrated with other management tools such as asset databases,
change management databases, and failure recovery databases. As a result, they require additional
human involvement during the end-to-end patch management process.
Among the public Cloud providers, EC2 and Azure VMs leave it to the customer to patch or update
the virtual machines. This results in newly provisioned VMs being immediately vulnerable.
Rackspace and Microsoft Azure (web and worker roles) provide patching of VMs as part of managed
services, but there is a significant cost associated with it. For example, Rackspace offers the
managed services at $100 per VM per month. It is unknown whether the patch management process
in Rackspace and Azure is fully automated or partially manual.
2
Patch management in enterprise
• Patch management challenges in traditional enterprise
– Lack of standardization in software/hardware/services
• Human intervention in every step (notification, approval, scheduling, deployment, post-
deployment)
– Every customer wants a different patch management policy to suit their needs
• Patch management solution provided to one customer can be completely different from
another
Little or no solution reuse
– Labor intensive and cost ineffective
3
Patch management in Cloud
• Cloud opportunities
– Highly standardized software/hardware stacks
• Aggressively increase the level of services automation in many processes of IT delivery,
including patch management
– A clean slate to offer customers with more standardized solutions and services at
a lower cost
• Lower provider’s operational costs get passed to customers so they are likely willing to
accept standardized services such as patch management
• Cloud challenges
– Extremely large, multi-tenancy, deeper vertical stack due to virtualization
4
Patch management process workflow
Patch notification Patch scheduling Patch deployment and post-dep
Vendor Patch Advisory System admin Application Change Mgmt Deployment Target
System team team System tools system
Security
notification Vulnerability
notification
Initiate change
Approve change
Request for
shutdown
Request for middleware / app shutdown
Shutdown Shutdown of middleware / app completed
complete
OS health check
OS health check completed
Request for check
Health check middleware / app
Health check result
Health check result
OS health check
OS health check completed
Request for check
Health check middleware / app
Health check result
Health check result
• Solution standardization
– Customer accepts one of the standardized change windows and patch management policies at
contract time
– No more waiting for customer approval for change requests
• Tools integration
– Enterprise system management tools have well-defined interfaces and can often be interfaced
programmatically
– Integration is critical to streamlining the end-to-end patch management workflow
An automated solution first requires the process to be standardized. A major hurdle that has
prevented patch management from being automated in existing managed environments is the lack of
services standardization, e.g., a patch management solution tailored to one customer often cannot be
used or easily adopted by another customer. Secondly, patch management is not simply using a
patch tool to apply patches to endpoint systems, but rather, a collaboration of multiple management
tools and teams, e.g., change management and patch advisory tools. Thirdly, in a large enterprise
environment, patch tools need to be able to manage a large number of managed entities in a
scalable way while being able to handle heterogeneity that is unavoidably in any large environments.
Lastly, to avoid problems due to automatically applying patches to endpoints, thorough testing of
patches beforehand is absolutely mandatory. However, post deployment health checking can be
lazily done as there are a plethora of monitoring tools, e.g., at platform, middleware, and applications
layers, that would help detect patch related problems as well as any adhoc health checking
procedures.
Standardize patch management services
• Patch severities (examples) – tradeoffs between timeliness and downtime
– Low/Medium severity patches to be applied every 3 months
– High/Critical severity patches to be applied every month
Patch testing
… time
We implement the means to deliver standardized patching solution based on an agreed set of
policies with the customer. We leave the actual policy to be defined to the customers and will
honor the policy as long as it is within the our pre-defined guideline
Some parameters that customers can use to define patch scheduling policy are patch severity and
virtual machine category. Patching causes downtime, and a customer can use patch severity to
balance between the amount of downtime due to patching and the benefits of applying a patch. For
example, for low and medium severity patches, they can be applied every 3 months; for high severity
patches, they can be applied every month. Orthogonal to patch severity, customer can also put
virtual machines into different categories, e.g., Test, Development, and Production. A patch is
scheduled to VMs of different categories in a staged fashion to minimize the chance that a
problematic patch is propagated to production machines.
Tool automation: TEM architecture and APIs
• Tivoli Endpoint Manager (TEM)
– Hierarchical infrastructure: Server, relay, clients
– Management through a console
• Core technologies
– Fixlets – action scripts to test patch applicability
and to apply patch
• The script language is flexible and fixlets can
be used beyond patch for other automation
tasks
• Hierarchical broadcast network overlay for
content distribution and data collection
– Programmatic APIs
– Implements Platform APIs and SOAP APIs to
allow external entities to control its action
programmatically
– Extensions needed to fully benefit Cloud
– Fully leverage automation enabled through Cloud
– Integrate with other management tools
9
Tool integration
Patch
Vendor Change Patch Health
Advisory
Advisory Mgmt Deployment Check
Mgmt
open/update/close CR patch->fixlets
open/update IR test fixlet applicability
schedule/cancel fixlets
update/close/markNA
PAE DB PAE UI
Section 4: Discussion
In this section, we discuss some more complex scenarios related to patching that are Cloud specific
or need to be more carefully handled in a Cloud environment.
Section 4.1: Multiple Time Zones
In a Cloud environment, virtual machines that are physically located in the same time zone can be
configured to operate in different time zones. When a customer’s VMs are spanning multiple time
zones, scheduling of patches needs to be carefully done so that the correct behavior is implemented.
For some patches, the correct behavior is to apply the patches at the same local time of each virtual
machine, e.g., applying MS10-081 from Microsoft to all Windows machines at 11pm of their
respective local time. For other patches, the correct behavior is to apply at the same absolute time to
avoid mixed-mode problem where multiple versions of a software are concurrently running, resulting
in data corruption. TEM supports both methods of scheduling patches, and it is up to the customers
to specify the intended behavior. The default behavior is to use local time.
4.2 VM suspension and snapshot
In a virtualized environment, there are additional modes of operations available to system
administrators and users, such as VM suspension and resume, snapshot and revert back. The
management console that allows users of these operations need to be tightly integrated with patch
management and compliance processes, otherwise, a VM could become noncompliant
unexpectedly. For example, before a VM is suspended, it should have been patched to the latest
patch level using the automated patch management process we have described previously. When it
is resumed after an extended amount of time, it will most likely be in a noncompliant state with
missing patches. Therefore, it is important that the patch management system catches it up to the
latest patch level before handing the VM to user’s control. Likewise, when a VM is reverted back to
an earlier snapshot, baselining the VM to the latest patch level is required.
13
Related work
• Live Patching
– OS [Lowell’04] [Potter’05] [Arnold’09]
– Application [Tewksbury’01]
• Offline patching
– VM image [Zhou’10]
• Future Works
- Automated post-patch testing using runtime signatures
[References]
[Lowell’04] D. E. Lowell, Y. Saito, and E. J. Samberg. Devirtualizable virtual machines enabling general single-node, online maintenance. In
Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems. Boston, MA,
USA, 2004.
[Potter’05] S. Potter and J. Nieh. Reducing downtime due to system maintenance and upgrades. In Proceedings of the 19th Conference on
Large Installation System Administration Conference. San Diego, CA, 2005.
[Zhou’10] W. Zhou, P. Ning, X. Zhang, G. Ammons, R. Wang, V. Bala. Always up-to-date: scalable offline patching of VM images in a compute
Cloud. In Proceedings of the 26th Annual Computer Security Applications Conference. Austin, TX, 2010.
[Tewksbury’01] L. Tewksbury, L. Moser, and M. Melliar-Smith. Live upgrades of CORBA applications using object replication. In International
Conference on Software Maintenance. Florence, Italy, Nov, 2001.
[Segal’89] M. E. Segal and O. Frieder. Dynamically updating distributed software: supporting change in uncertain and mistrustful environments.
In IEEE Conference on Software Maintenance, Oct, 1989.
[Choi’09] Online application upgrade using edition-based redefinition. In ACM Workshop on Hot Topics in Software Upgrades, Oct, 2009.
[Crameri’07] O. Crameri, N. Knezvic, D. Kostic, R. Bianchini, and W. Zwaenepoel. Staged deployment in Mirage: an integrated software
upgrade testing and distribution system. In Symposium on Operating Systems Principles, Oct, 2007.
[Dumitras’10] T. Dumitras, P. Narasimhan, E. Tilevich. To Upgrade or Not to Upgrade: Impact of Online Upgrades across Multiple
Administrative Domains. Onward, Oct, 2010.
[Arnold’09] J. Arnold, M. F. Kaashoek. Ksplice: Automatic Rebootless Kernel Updates Eurosys, 2009.
[WSUS’10] Windows server update services (URL) accessed August 2011. http://technet.microsoft.com/en-us/windowsserver/bb332157
[RHN’10] Red Hat Network (RHN), accessed August, 2011. https://www.redhat.com/red_hat_network/
[TEM’11] IBM Tivoli Endpoint Manager (URL), accessed August, 2011. https://www-01.ibm.com/software/tivoli/solutions/endpoint/?s_pkg=bfwm
[VMwareUpdate’11] VMware Update Manager (URL), accessed August 2011. http://www.vmware.com/products/update-manager/overview.html