Operations Manager 2007 R2 Design Guide: Author
Operations Manager 2007 R2 Design Guide: Author
Microsoft Corporation
Published: September 2010
Author
Christopher Fox
The information contained in this document represents the current view of Microsoft Corporation
on the issues discussed as of the date of publication. Because Microsoft must respond to
changing market conditions, it should not be interpreted to be a commitment on the part of
Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the
date of publication.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the
rights under copyright, no part of this document may be reproduced, stored in or introduced into a
retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission
of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual
property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft, the furnishing of this document does not give you any
license to these patents, trademarks, copyrights, or other intellectual property.
Unless otherwise noted, the companies, organizations, products, domain names, e-mail
addresses, logos, people, places, and events depicted in examples herein are fictitious. No
association with any real company, organization, product, domain name, e-mail address, logo,
person, place, or event is intended or should be inferred.
© 2009 Microsoft Corporation. All rights reserved.
Microsoft, Active Directory, ActiveSync, Internet Explorer, JScript, SharePoint, SQL Server, Visio,
Visual Basic, Visual Studio, Win32, Windows, Windows PowerShell, Windows Server, and
Windows Vista are trademarks of the Microsoft group of companies.
All other trademarks are property of their respective owners.
Revision History
Release Date Changes
6
production. When you reach this point, the next guide to use is the Operations Manager 2007
Deployment Guide.
Please note that this guide is intended to do just as its name says, to guide you. The decisions
that you make and the design you come to in the end must ultimately be based on your needs.
The guide helps make sure that you have all the information you need to make the best decisions
for your particular situation.
OperationsManager Database
The OperationsManager database is the first component to be installed in all management
groups. This database holds all the configuration data for the management group and stores all
the monitoring data that has been collected and processed by the agents.
To optimize performance of Operations Manager, you must keep the size of the
OperationsManager database under control. Testing has shown that staying under 50 GB is a
7
good practice. To keep from exceeding this limit, Operations Manager 2007 will automatically
groom out older, unnecessary data according to parameters that you set.
Because only one OperationsManager database can be in a management group, it must be
functional for the management group to function. To mitigate the single instance of the
OperationsManager database from being a single point of failure, the OperationsManager
database can be placed in a Cluster service (formerly known as MSCS) failover cluster. In
addition, log shipping can be configured so that current operations data and configuration
information can be sent to another Microsoft SQL server of the same version that is hosting a
duplicate copy of the primary OperationsManager database. Should there be a failure in the
primary database, the duplicate can be updated and switched to. The OperationsManager
database is involved in these activities:
management pack import – Management pack imports place a load on the CPU, the memory,
and the disk of the database server.
discovery – As the discovery process occurs, agents return data to the management servers.
Ultimately, this data is inserted into the OperationsManager database. This process places a
load on the disk and on the CPU of the database server.
monitoring operations – All data that is collected from agents and all management group
configuration information is stored in the OperationsManager database.
8
maintenance of the Instance space – The System Center Management Configuration service
calculates the configurations for all monitored devices in the management group. To do this,
the service maintains a copy of all the configuration information in memory and performs its
calculations there. This places a load on memory. After the instance space calculations are
run, agents send a synchronization request to their management server, which sends the
request to the RMS. The RMS stores these requests until it can act upon them in an in-
memory queue.
discovery – After management packs are sent to the agents, the discovery process starts.
Agents return the discovery data to their management servers and then to the RMS. This
data is inserted into the OperationsManager database and incorporated in the Instance
space. Both activities place a load on the disk, the CPU, and the memory on the RMS.
Agent
An Operations Manager 2007 agent is a service that is deployed to a computer that you want to
monitor. On the monitored device, an agent is listed as the System Center Management service.
Every agent reports to a management server in the management group. This management server
is referred to as the agent's primary management server. Agents watch data sources on the
monitored device and collect information according to the configuration that is sent to it from its
management server. The agent also calculates the health state of the monitored object and
reports back to the management server. When the health state of a monitored object changes or
other criteria are met, an alert can be generated from the agent. This lets operators know that
something has gone awry and requires attention.
Agents also have the ability to take many different types of action to help diagnose issues or
correct them. By feeding health data to the management server about the monitored device, the
agent provides an up-to-date picture of the health of the device and all the applications that it
hosts.
It is possible to monitor devices in an agentless fashion. In this case, a management server
performs the monitoring remotely.
Operations Console
The Operations console provides a single, unified user interface for interacting with Operations
Manager 2007. The Operations console provides access to monitoring data, basic management
pack authoring tools, Operations Manager 2007 reports, all the controls and tools necessary for
administering Operations Manager 2007, and a customizable workspace.
For a user to access the Operations console, the user's Active Directory user account must be
assigned to an Operations Manager 2007 user role. A user role is the combination of a scope of
devices that access is granted to and a profile that defines what the role can do within its defined
scope. Role-based security is enforced in the Operations console so that Operations Manager
administrators can define what any given user can see in the console and what actions the user
can take on those items. For more information, see the "Role-Based Security" section in this
document.
9
Management Packs
Management packs contain an application's health definition as defined by the application
developers. When imported into Operations Manager, they enable the agent to monitor the health
of an application, generate alerts when something of significance goes wrong in the application,
and take actions in the application and its supporting infrastructure to further diagnose the
application or restore it to a healthy state. Without an application, operating-system, or device-
specific management pack, Operations Manager 2007 is unaware of those entities and is unable
to monitor them.
Management Server
A management server is used primarily for receiving configurations and management packs from
the RMS and distributing them to the agents that report to the management server. It does not
perform any of the special functions of the RMS. A management server can be promoted to the
RMS role if the RMS fails, as long as it was present in the management group prior to the RMS
failure. Multiple management servers are installed in a management group to provide extra
capacity for agent management. In addition to providing scalability, introducing additional
management servers in a management group allows for agents to fail over and start reporting
their data to another management server if communication with their primary management server
is lost.
The management server can also be used for remote monitoring purposes (such as URL
monitoring and cross-platform monitoring). One additional role for a management server is to host
the Audit Collection Service (ACS) Collector role. The ACS Collector can be installed only on a
management server or gateway server. See the "Audit Collection Service (ACS)" section later in
this document for additional information about Audit Collection Services. Other roles include the
AEM file share role, which is also explained later in this document.
The management server makes heavy use of the CPU for data collection activities, and it also
makes heavy use of disk for UNIX and Linux data queues.
Gateway Server
Operations Manager 2007 requires that agents and management servers authenticate each other
and establish an encrypted communication channel before they exchange information. Kerberos
is the default authentication protocol. When the agent and the management server are in the
same Active Directory forest or in forests with forest trust, mutual authentication occurs
automatically. This is because Kerberos is the default authentication protocol in Active Directory.
When agents and management servers are not within the same Kerberos trust boundary (that is,
not in the same Active Directory forest or in forests with forest trust), certificate-based
10
authentication mechanisms must be used. In this situation, a certificate must be issued and
maintained for those agents and the management servers to which they report. In addition, if
there is a firewall between the agents and the management server, either the firewall rules must
permit each computer that hosts an agent to communicate directly through it over an encrypted
channel or the Operations Manager communication port must be opened inbound.
An Operations Manager 2007 gateway server can be used to drastically reduce the administrative
overhead required to maintain communication between agents and management servers that are
separated by a trust boundary. The gateway server acts as a proxy for agent communications.
The gateway server is placed within the trust boundary of the agents (which can be a domain),
and all the agents communicate with it. Then the gateway server, through the use of its computer
certificate, performs mutual authentication with the management server and forwards the agent-
to-management server and management server-to-agent communications along. This then
requires only one certificate for the management server and one for the gateway. In the firewall
scenario, only the gateway server and the management server need to be authorized to
communicate with each other.
Multiple gateway servers can be installed in a management group for the purposes of scalability
and failover. Should an agent lose communication with its gateway server, it can then fail over to
a different gateway server that is in the same management group and within the agent's trust
boundary.
Likewise, gateway servers can be configured to fail over between management servers in a
management group. This configuration then provides fully redundant communication channels for
agents that lie outside a management server's trust boundary.
The gateway server participates in the following activities:
All data communication between untrusted agents and management servers – gateway
servers proxy communications between management servers and agents. They also serve as
a concentration point for the same communications. This data consists of configuration data
and management packs that are sent to the agent, and it consists of discovery and
monitoring data that is sent to the management server. All this data is queued on the gateway
servers local disk. Because this places a significant load on the gateway server disk, be sure
to provide plenty of fast disk.
Reporting Server
Operations Manager Reporting Server is installed into an instance of Microsoft SQL 2005
Reporting Services SP1 or later or Microsoft SQL Server 2008 SP1 Reporting Services. It is
responsible for building and presenting the reports from data queried from the Reporting Data
Warehouse. All reports are accessed in the Operations console, so access to reports is controlled
via role-based security.
ACS Forwarder
The ACS Forwarder is embedded in the Operations Manager 2007 agent, so no separate
deployment or configuration is required. The ACS Forwarder appears as the Audit Forwarder
service and is disabled by default. The ACS Forwarder on an individual computer or on groups of
computers is enabled via a task in the Operations console.
12
ACS Collector Server
The main purpose of the ACS Collector server is to collect, filter, and pre-process all the Windows
security log events for insertion into the database. Because the ACS collects all security events in
near real-time, vast amounts of data enters the system from the forwarders. Not all of this
information will be of interest to your company, as defined in your company's Windows Audit
Group Policy. The filtering mechanism at the collector allows you to specify which events you
want written to the ACS database for long-term storage.
The ACS Collector server has a separate installation program from the Operations Manager
servers, agents, or reporting components. It can be installed only on an existing management
server or RMS if you have not installed any additional management servers. One ACS Collector
server can support hundreds to thousands of servers, depending on the server role and Windows
Audit Group Policy, and tens of thousands of workstations. However, there is a one-to-one
relationship between the ACS Collector server and the ACS Database (which is discussed in the
next section). If for scalability or control reasons your company requires additional ACS
Collectors, you will need one ACS Database per ACS Collector.
ACS Database
After the data has been pre-processed by the ACS Collector server, it is written to its ACS
Database, which is just a database created on a Microsoft SQL Server 2005 SP1 or SP2
instance. Because it is a standard SQL database, it can be clustered for high-availability. To
accommodate the one-to-one relationship between collectors and databases, you can create, via
named instances, multiple ACS Databases on a single SQL Server 2005 server as long as it can
support the additional load. For more information about sizing and capacity planning for ACS, see
that section later in this guide.
ACS Reporting
The ACS Reporting server is also a separately installed component. A number of preconfigured
reports are available. Installation of ACS Reporting requires an existing instance of SQL
Server 2005 SP1 or later Reporting Services or Microsoft SQL Server 2008 Reporting Services.
This can be a stand-alone instance, or you can install ACS Reporting along with Operations
Manager 2007 Reporting with one tradeoff.
If you install ACS Reporting into the same Reporting Services instance as Operations
Manager 2007 Reporting, ACS Reporting is fully integrated with Operations Manager Reporting.
This results in reduced administrative overhead, because anyone who has been assigned to the
Operations Manager Reporting role will have access to the ACS reports. Some companies might
not find this to be a desirable configuration, and they might elect to install ACS Reporting into its
own instance of SQL Server Reporting. In this case, you must define your own security groups
and roles, resulting in higher administrative overhead but extremely tight control over access to
ACS data.
13
Proxy Agent
Operations Manager 2007 has the ability to monitor network devices, via SNMP v2, computers
that are not running a Windows operating system, and computers without agents. In these cases,
another computer that has an agent installed is actually performing the monitoring remotely. The
computer that is performing the remote monitoring is called a proxy agent. The agent that is
acting as a proxy for monitoring other devices is a standard Operations Manager agent. It is
merely configured differently by selecting the Allow this agent to act as a proxy and discover
managed objects and other computers option in the agent properties. Then you configure the
agentless managed device to designate the proxy agent it is to use. For more information about
agent deployment and management of devices, please see the Operations Manager 2007
Operations Guide.
Features
Features are present by default and only require configuration to make use of them. The ability to
configure and use features as you please in Operations Manager 2007 is a hallmark of its
flexibility.
14
TCP port 1270 and always originates from the management server or gateway server. In some
cases, such as when the WSMAN layer is not present on the monitored computer or it has failed,
the communication can occur to SSH TCP 22. SSH can be used for installing the WSMAN layer
or performing diagnostics.
15
Connector Framework
The Connector Framework is an application programming interface (API) that exposes
Operations Manager functionality for the purposes of integrating with other management products
or other technologies, such as trouble-ticketing systems. It enables the development of
connectors that can bidirectionally exchange information with Operations Manager. The
Connector Framework interacts primarily with the System Center Data Access service on the
RMS. For more information about developing applications that use the Operations Manager 2007
Connector Framework, see the Operations Manager 2007 SDK.
URL Monitoring
Operations Manager 2007 provides the ability to monitor URL availability from a watcher node
(another health service running on a different computer. Management servers that perform this
function will have a heavy load placed on their CPU resources and then disk resources. If you are
going to monitor more than 1000 URLs, you should create a dedicated management group for
this purpose.
Concepts
In planning your topology, you must understand the concept of role-based security as
implemented in Operations Manager.
16
selected the profile that you want to use, you then create the scope of objects that the role will
have access to. In this fashion, you can create a role that uses the Operator profile and is
scoped only to Microsoft Exchange Servers for your Exchange administrators. When you then
assign the Exchange administrators to the role (either by membership in an Active Directory
Group or by individual account), they are able to open the Operations console, but they only see
the Exchange servers and are allowed to take action only on the Exchange-related Alerts, Views,
and Tasks.
Role-based security is applied no matter how you access Operations Manager functionality,
whether it is through the Web console or the Command Shell. For more information about roles
and role-based security, see the Operations Manager 2007 Security Guide.
Business Requirements
The business owners that you need to work with are not just the top-level executives who are
sponsoring your Operations Manager project; they are the managers and directors who are
responsible for the business processes that make your company money. They might not be
particularly interested in Operations Manager as a product, but they are very interested in the
level of service that IT is providing to support their mission-critical applications.
When you are having your discussions with people in these roles, their interests will likely center
on four areas:
17
Ongoing service from IT
Performance information about their application
Regulatory compliance
Return on IT costs
Performance Information
When you are discussing what the business owners need to know about application performance
information, it is important to distinguish between business process performance and application
performance. Business process performance (or metrics) are provided by business intelligence
applications, usually in the form of reports and balanced scorecards, and are not part of this
conversation. The expectations that you must understand here are those relating to application
performance. Make sure you understand and discuss these points:
What application performance information are they receiving today? What would they like to
receive? Knowing this will help you with role planning (profiles and scopes).
How are they receiving application performance information today? How would they like to
receive it? Knowing this helps you decide how to provide access to the performance
information. For example, do they need an Operations console with Read-Only Operator and
Reports access scoped to their application, or would the Web console suffice?
Regulatory Compliance
Regulatory compliance is a critical issue with business process owners now and into the future.
The business process owners look to IT to participate in the company's plans to achieve
compliancy and to stay compliant. Be sure to cover these points:
Does the business process fall under regulation? If it does, what do the regulations state?
Knowing this will help with your Audit Collection Service (ACS) planning and role planning.
What sort of data is the business process owner looking to IT to provide and for what time
frames? This will help with reporting planning and data-retention planning.
18
Return on IT Costs
Either through direct charge backs or through an indirect overhead charge, the business owners
are paying for IT services, and like all good business owners, they want to know what they are
getting for their payments. You can use Operations Manager Reporting as a vehicle for providing
these answers, but you need to know what it is the business owner finds value in. Be sure to
cover these points:
What do they see as the most valuable services that IT provides to them? Knowing this helps
with report planning.
Are they aware of what they are getting for their IT overhead now? Knowing this, you might
choose to assemble different reports that demonstrate the services provided to the business
owner that are outside of their application.
IT Requirements
The IT requirements are going to drive the topology of Operations Manager and its supporting
infrastructure. The two main factors that will shape your IT requirements are your optimization
goals and the IT environment that Operations Manager will exist in. You will gather these
requirements from IT sponsors, key stakeholders, and consumers of Operations Manager data.
These conversations should consist of broad, open-ended questions on your part. Start by asking
how Operations Manager should be used in the environment and what the implementation should
be optimized for. Be sure to cover the following points.
Optimization Goals
Optimization goals are aspects of your Operations Manager implementation that must be met by
the design. They are exemplified by statements such as the following:
Availability/Recoverability--Operations Manager must be available with minimal outages.
Knowing this helps you with your high availability and backup/recovery planning.
Cost--Operations Manager must be implemented as economically as possible. Knowing and
operating within the budgetary constraints is critical to the success of the project.
Performance--For example, Operations Manager must report data from the environment in no
more than 1 minute and console access must occur in no more than 10 seconds after the
console is launched. Knowing this helps with the hardware planning.
Scope--Operations Manager must provide a single view of the entire environment. Knowing
this helps you with planning the number of management groups that will be needed and the
relationships between them.
Administration--Operations Manager administration must be restricted (or available) to certain
groups. Knowing this helps you plan security groups, roles, access, and potentially the
number of management groups that you will implement.
Location of Access Points--Operations Manager data must be accessible only from within the
company's intranet, or it must be available internally and externally. Knowing this helps you
plan where Operations consoles and Web consoles will be made available.
Integration--Operations Manager must integrate with the existing trouble ticketing system or
other enterprise-monitoring product. Knowing this helps you plan where Operations Manager
19
and its features will fit in your environment and the role it will play. It also helps you decide if
third-party connectors or connectors developed in house will be necessary.
20
Who normally responds to issues or alerts that are raised by automated systems or
helpdesk? Knowing this will help determine who needs direct access to the Operations
console and what data the console should contain.
Does the help desk usually resolve server issues, or are issues passed to the server support
teams?
Does the company have a manned Network Operations Center or other manned monitoring
system in place? If yes, how many people and how many consoles are in continuous use?
This helps determine where management groups can be placed so that they will receive
adequate support.
How many locations other than datacenters will have agents deployed to them, and where
are they on the network?
What are the available bandwidth statistics between the sites where managed devices are
and the sites where the management servers are?
How is security logging performed currently?
How are desktop or client applications monitored currently?
How is monitoring performed for UNIX-based or Linux-based computers and network
devices?
21
process of distributing Operations Manager services among multiple management groups is
called partitioning.
This section addresses the general criteria that would necessitate multiple management groups.
Planning the composition of individual management groups, such as determining the sizing of
servers and distribution of Operations Manager roles among servers in a management group, is
covered in the "Management Group Composition" section.
22
environment. The pre-production management group is used for testing and tuning
management pack functionality before it is migrated into the production environment. In
addition, some companies employ a staging environment for servers where newly built
servers are placed for a burn-in period prior to being placed into production. The pre-
production management group can be used to monitor the staging environment to ensure the
health of servers prior to production rollout.
Dedicated ACS Functionality—If your requirements include the need to collect the Windows
Audit Security log events, you will be implementing the Audit Collection Service (ACS). It
might be beneficial to implement a management group that exclusively supports the ACS
function if your company's security requirements mandate that the ACS function be controlled
and administered by a separate administrative group other than that which administers the
rest of the production environment.
Disaster Recovery Functionality—In Operations Manager 2007, all interactions with the
OperationsManager database are recorded in transaction logs prior to being committed to the
database. Those transaction logs can be sent to another Microsoft SQL Server 2005 SP1 or
higher or Microsoft SQL Server 2008 SP1 server and committed to a copy of the
OperationsManager database there. This technique is called log shipping. The failover
location must contain the failover SQL Server that receives the shipped logs and at least one
management server that is a member of the source management group. If it is necessary to
execute a failover, you must edit the registry on the management server in the failover
location to point it to the failover SQL Server and restart the System Center Management
Service. Then promote the failover management server to the RMS role. To complete the
failover and return the management group to full functionality you then change the registry on
all the remaining management servers in the management group to point to the failover SQL
server and restart the System Center Management Service on each management server.
Increased Capacity—Operations Manager 2007 has no built-in limits regarding the number of
agents that a single management group can support. Depending on the hardware that you
use and the monitoring load (more management packs deployed means a higher monitoring
load) on the management group, you might need multiple management groups in order to
maintain acceptable performance.
Consolidated Views—When multiple management groups are used to monitor an
environment, a mechanism is needed to provide a consolidated view of the monitoring and
alerting data from them. This can be accomplished by deploying an additional management
group (which might or might not have any monitoring responsibilities) that has access to all
the data in all other management groups. These management groups are then said to be
connected. The management group that is used to provide a consolidated view of the data is
called the Local Management Group, and the others that provide data to it are called
Connected Management Groups.
Installed Languages—All servers that have an Operations Manager server role installed on
them must be installed in the same language. That is to say that you cannot install the RMS
using the English version of Operations Manager 2007 and then deploy the Operations
console using the Japanese version. If the monitoring needs to span multiple languages,
additional management groups will be needed for each language of the operators.
Security and Administrative—Partitioning management groups for security and administrative
reasons is very similar in concept to the delegation of administrative authority over Active
Directory Organizational Units or Domains to different administrative groups. Your company
23
might include multiple IT groups, each with their own area of responsibility. The area might be
a certain geographical area or business division. For example, in the case of a holding
company, it can be one of the subsidiary companies. Where this type of full delegation of
administrative authority from the centralized IT group exists, it might be useful to implement
management group structures in each of the areas. Then they can be configured as
Connected management groups to a Local management group that resides in the centralized
IT data center.
The preceding scenarios should give you a clear picture of how many management groups you
will need in your Operations Manager infrastructure. The next section covers the distribution of
server roles within a management group and the sizing requirements for those systems.
24
root management
server
Administrator console Yes Windows XP, Windows N/A
Vista, Windows
Server 2003, and
Windows Server 2008
ACS collector Yes Can be combined with No
gateway server and
audit database
gateway server Yes Can be combined with No
ACS collector only;
must be a domain
member
Web console server Yes N/A
agent Yes Automatically deployed N/A
to root management
server and
management server in
a management group
Availability
The need for high availability for the databases, the RMS, management servers, and gateway
servers can be addressed by building redundancy into the management group.
Database—All databases used in Operations Manager 2007 require Microsoft SQL
Server 2005 SP1 or higher or Microsoft SQL Server 2008 SP1 or higher, which can be
installed into a MSCS quorum node failover configuration.
Note
For more information on Cluster services, refer to the Windows Server 2003 and
Windows Server 2008 online help.
25
RMS—The System Center Data Access service and System Center Management
Configuration service run only on the RMS, and this makes them a single point of failure in
the management group. Given the critical role that the RMS plays, if your requirements
include high availability, the RMS should also be installed into its own two-node failover
cluster. For complete details on how to cluster the RMS, see the Operations Manager 2007
Deployment Guide.
Management servers—In Operations Manager, agents in a management group can report to
any management server in that group. Therefore, having more than one management server
available provides redundant paths for agent/server communication. The best practice then is
to deploy one or two management servers in addition to the RMS and to use the Agent
Assignment and Failover Wizard to assign the agents to the management servers and to
exclude the RMS from handling agents.
Gateway servers—Gateway servers serve as a communications intermediary between
management servers and agents that lie outside the Kerberos trust boundary of the
management servers. Agents can fail over between gateway servers just as they can
between management servers if communications with the primary server of either one is lost.
Likewise, gateway servers can be configured to fail over between management servers,
providing a fully redundant set of pathways from the agents to the management servers. See
the Operations Manager 2007 Deployment Guide for procedures on how to deploy this
configuration.
Cost
The more distributed the management group server roles are, the more resources will be needed
to support that configuration. This includes hardware, environment, licensing, operations, and
maintenance overhead. Designing with cost control as the optimization goal moves you in the
direction of a single-server implementation or minimal role distribution; this in turn reduces
redundancy and, potentially, performance.
Performance
With performance as an optimization goal, you will be better served, with a more distributed
configuration and higher-end hardware. Commensurately, cost will rise.
26
Note
For additional information on sizing Operations Manager infrastructure, see the
Operations Manager 2007 R2 Sizing Helper at http://go.microsoft.com/fwlink/?
LinkID=200081
# of Monitored Devices Server Roles and Config Server Roles and Config
# of Monitored Server Role and Server Role and Server Role and Server Role and
Devices Config Config Config Config
27
# of Server Role Server Role and Server Role Server Role Server Role
Monitored and Config Config and Config and Config and Config
Devices
# of Server Role Server Role and Server Role Server Role Server Role
Monitored and Config Config and Config and Config and Config
Devices
28
performance
can be used to
meet the DW
storage needs.
29
should generally ensure that the RMS, OperationsManager database, and Data Warehouse
database are on the same local area network.
Note
To calculate the OperationsManager database size, use the Operations Manager 2007
R2 Sizing Helper Tool at http://go.microsoft.com/fwlink/?LinkID=200081
The rate of data collection—The RMS frequently communicates with the Operations
Database and Data Warehouse. In general, these SQL connections consume more
bandwidth and are more sensitive to network latency than connections between agents and
the RMS. Therefore, you should generally ensure that the RMS, OperationsManager
database, and Data Warehouse database are on the same local area network.
The rate of instance space changes—The instance space is the data that Operations
Manager maintains to describe all the monitored computers, services, and applications in the
management group. Updating this data in the OperationsManager database is costly relative
to writing new operational data to the database. Additionally, when instance space data
changes, the RMS makes additional queries to the OperationsManager database to compute
configuration and group changes. The rate of instance space changes increases as you
import additional management packs into your management group. Adding new agents to the
management group also temporarily increases the rate of instance space changes.
Concurrent Operations console and other SDK clients—Each open instance of the
Operations console reads data from the OperationsManager database. Querying this data
consumes potentially large amounts of disk activity as well as CPU and RAM. Consoles
displaying large amounts of operational data in the Events View, State View, Alerts View, and
Performance Data View tend to put the largest load on the database. To achieve maximum
scalability, consider scoping views to include only necessary data.
Following are some best practices for sizing the OperationsManager database server:
Choose an appropriate disk subsystem—The disk subsystem for the OperationsManager
database is the most critical component for overall management group scalability and
performance. The disk volume for the database should typically be RAID 0+1 with an
appropriate number of spindles. RAID 5 is typically an inappropriate choice for this
component because it optimizes storage space at the cost of performance. Because the
primary factor in choosing a disk subsystem for the OperationsManager database is
performance rather than overall storage space, RAID 0+1 is more appropriate. When your
scalability needs do not exceed the throughput of a single drive, RAID 1 is often an
appropriate choice because it provides fault tolerance without a performance penalty.
The placement of data files and transaction logs—For lower-scale deployments, it is often
most cost-effective to combine the SQL data file and transaction logs on a single physical
volume because the amount of activity generated by the transaction log isn’t very high.
30
However, as the number of agents increases, you should consider placing the SQL data file
and transaction log on separate physical volumes. This allows the transaction log volume to
perform reads and writes more efficiently. This is because the workload will consist of mostly
sequential writes. A single two-spindle RAID 1 volume is capable of handling very high
volumes of sequential writes and should be sufficient for almost all deployments, even at a
very high scale.
Use 64-bit hardware and operating system—The OperationsManager database often benefits
from large amounts of RAM, and this can be a cost-effective way of reducing the amount of
disk activity performed on this server. Using 64-bit hardware enables you to easily increase
memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of
RAM, using 64-bit hardware gives you room for growth if your requirements change in the
future.
Use a battery-backed write-caching disk controller—Testing has shown that the workload on
the OperationsManager Database benefits from write caching on disk controllers. When
configuring read caching vs. write caching on disk controllers, allocating 100 percent of the
cache to write caching is recommended. When using write-caching disk controllers with any
database system, it is important to ensure they have a proper battery backup system to
prevent data loss in the event of an outage.
31
Warehouse are located on the same physical machine and you want to separate data and
transaction logs, you must put the transaction logs for the OperationsManager database on a
separate physical volume from the Data Warehouse to see any benefit. The data files for the
OperationsManager database and Data Warehouse can share the same physical volume, as
long as it is appropriately sized.
Use 64-bit hardware and operating system—The Data Warehouse often benefits from large
amounts of RAM, and this can be a cost-effective way of reducing the amount of disk activity
performed on this server. Using 64-bit hardware enables you to easily increase memory
beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM,
using 64-bit hardware gives you room for growth if your requirements change in the future.
Use dedicated server hardware for the Data Warehouse—Although lower-scale deployments
can often consolidate the OperationsManager database and Data Warehouse onto the same
physical machine, it is advantageous to separate them as the number of agents increases
and, consequently, the volume of incoming operational data increases as well. You will also
see better reporting performance if the Data Warehouse and Reporting servers are
separated.
Use a battery-backed write-caching disk controller—Testing has shown that the workload on
the Data Warehouse benefits from write caching on disk controllers. When configuring read
caching versus write caching on disk controllers, allocating 100 percent of the cache to write
caching is recommended. When using write-caching disk controllers with any database
system, it is important to ensure they have a proper battery backup system to prevent data
loss in the event of an outage.
32
management server is a guideline based on test experience and not a hard limit. You might
find that a management server in your environment is able to support a higher or lower
number of agents.
To maximize the UNIX or Linux computer-to-management-server ratio (500:1), use dedicated
management servers for cross-platform monitoring.
Use the minimum number of management servers per management group to satisfy
redundancy requirements—The main reason for deploying multiple management servers
should be to provide for redundancy and disaster recovery rather than scalability. Based on
testing, most deployments will not need more than three to five management servers to
satisfy these needs.
34
When planning the rollout of collective monitoring clients, the agents should be approved in
batches of no more than 1,000 at a time to allow the agents to get synchronized with the latest
configuration.
Design Decisions
There are four fundamental design decisions to make when planning your ACS implementation.
As you make these decisions, keep in mind that there is a one-to-one relationship between the
ACS Collector server and its ACS database. An ACS database can have only one ACS Collector
feeding data to it at a time, and every ACS Collector needs its own ACS database. It is possible to
have multiple ACS Collector/Database pairs in a management group; however, there are no
procedures available out of the box for integrating the data from multiple ACS databases into a
single database.
The first decision that must be made is whether or not to deploy a management group that is
exclusively used to support ACS or to deploy ACS into a management group that also provides
health monitoring and alerting services. Here are the characteristics of these two ACS
deployment scenarios.
ACS hosted in a production management group scenario:
Scaled usage of ACS—Given that ACS collects every security event from the systems
that ACS Forwarders are enabled on, the use of ACS can generate a huge amount of
data. Unless you are using dedicated hardware for the ACS Collector and Database
roles, processing this data might negatively affect the performance of the hosting
management group, particularly in the database layer.
Separate administration and security is not required—Because ACS is hosted in a
management group, people with administrative control in the management group will
have administrative rights in ACS. If the business, regulator/audit, and IT requirements
mandate that ACS be under nonproduction IT control, deploying ACS into a production
management group scenario is not an option.
ACS hosted on a dedicated management group scenario:
Separate administration and security is required—If there is a separate administrative
group that is responsible for audit and security controls at your company, hosting ACS on
a dedicated management group administered by the audit/security group is
recommended.
35
The second decision that must be made is whether or not to deploy ACS Reporting into the same
SQL Server 2005 Reporting Services instance as the Operations Manager 2007 Reporting
component. Here are the characteristics of these two scenarios.
ACS reporting integrated with Operations Manager Reporting:
Single console for all reports—When ACS Reporting is installed with Operations Manager
Reporting, the ACS reports are accessed via the Operations Manager Operations
console.
Common security model—When Operations Manager 2007 Reporting is installed into
SQL Server 2005 Reporting Services, it overwrites the default security model, replacing it
with the Operations Manager role-based security model. ACS Reporting is compatible
with this model. All users who have been assigned the Report Operator role will have
access to the ACS Reports as long as they also have the necessary permissions on the
ACS database.
Note
If Operations Manager Reporting is later uninstalled, the original SRS security model
must be restored manually using the ResetSRS.exe utility found on the installation
media in the SupportTools directory.
ACS reporting installed on a dedicated SQL Server Reporting Services instance:
Separate console for ACS and Operations Manager reports—When installed on a
dedicated SRS instance, the ACS Reports are accessed via the SRS Web site that is
created for it at installation. This provides greater flexibility in configuring the folder
structure and in using SRS Report designer.
Separate security model—A consequence of using a dedicated SRS instance is that you
can create security roles as needed to meet the business and IT requirements to control
access to the ACS reports. Note that the necessary permissions must still be granted on
the ACS database.
The third design decision that must be made is how many ACS Collector/Database pairs to
deploy to support your environment. The rate that a single ACS Collector/Database pair can
support an ongoing event collection and insertion is not an absolute number. This rate is
dependent upon the performance of the storage subsystem that the database server is attached
to. For example a low-end SAN solution can typically support up to 2,500 to 3,000 security events
per second. Independent of this the ACS Collector has been observed supporting bursts of
20,000 security events per second. Following are factors that affect the number of security events
generated per second:
Audit Policy Configuration—The more aggressive the audit policy, the greater the number of
Security events that are generated from audited machines
The role of the machine that the ACS forwarder is enabled on, given the default Audit Policy,
Domain Controller will generate the most security events. Member servers will generate the
next highest amount, and workstations will generate the least.
36
Windows Server 2003 Member Server 2 events per second
Workstation 0.2 events per second
Using the numbers in the preceding table, a single, high-end ACS Collector/Database pair
can support up to 150 Domain Controllers, 3,000 Member Servers, or 20,000 Workstations
(with the appropriate ACS Collector filter applied).
The amount of user activity on the network—If your network is used by high-end users
conducting a large number of transactions, as is experienced, for example, at Microsoft, more
events will be generated. If your network users conduct relatively few transactions, such as
might be the case at a retail kiosk or in a warehouse scenario, you should expect fewer
security events.
The ACS Collector Filter configuration—ACS collects all security events from a monitored
machine's security event log. Out of all the events collected, you might be interested in only a
smaller subset. ACS provides the ability to filter out the undesired events, allowing only the
desired ones to be processed by the Collector and then inserted into the ACS database. As
the amount of filtering increases, fewer events will be processed and inserted into the ACS
database.
The last design decision that must be made is the version of SQL Server 2005 or SQL
Server 2008 to use for the ACS database. ACS supports the use of SQL Server 2005 Standard
edition and SQL Server 2005 Enterprise edition or SQL Server 2008 Standard or Enterprise
editions. Which version is used has an impact on how the system will behave during the daily
database maintenance window. During the maintenance window, database partitions whose time
stamps lie outside the data retention schedule (with 14 days being a typical configuration for data
retention) are dropped from the database. If SQL Server Standard edition is used, Security event
insertion halts and events queue up on the ACS Collector until maintenance is completed. If SQL
Server Enterprise edition is used, insertion of processed Security events continues, but at only 30
percent to 40 percent of the regular rate. This is one reason why you should carefully pick the
timeframe for daily database maintenance, selecting a time when there is the least amount of
user and application activity on the network.
Important
To effectively size ACS, you must determine the number of disks required for ACS disk
I/O and you must determine the ACS database size. The processes of calculating these
values are detailed in the "Sizing ACS" section. Each ACS collector must have its own
ACS database. The rate of data insertion to the database, which is dictated by the
performance of the storage subsystem, determines the capacity of a single ACS collector.
The more disks that a single disk array can support, the better it can perform.
Tip
37
ACS supports the use of SQL Server 2005 Standard Edition and SQL Server 2005
Enterprise Edition; however, the edition you use affects how the system performs during
the daily database maintenance window. During the maintenance window, database
partitions with time stamps outside of the default 14-day data retention schedule are
dropped from the database. If SQL Server 2005 Standard Edition is used, Security event
insertion halts and events queue in the ACS Collector until maintenance is completed. If
SQL Server 2005 Enterprise Edition is used, Security event insertion continues, but at
only at 30 to 40 percent of the regular rate. Therefore, you should carefully pick the
timeframe for daily database maintenance, selecting a time when there is the least
amount of user and application activity on the network.
Sizing ACS
The number of ACS collectors and the sizing of the ACS database and the sizing of the disk
subsystem for the database are entirely dictated by the volume of security events that get
forwarded to it as measured in events per second. You perform ACS sizing calculations to find out
three things:
1. The number of ACS Collectors you will need
2. How much space you will need to allot for the database
3. How many disks you will need to support the expected throughput on the database
Ideally, you could determine the number of security events generated by computers in your
organization by installing a pilot ACS collector to measure the incoming event rate. If you have a
pilot ACS collector, you can monitor the ACS Collector\Incoming Event per Sec performance
monitor counter. However, if you do not have a pilot ACS collector, you can use the sizing
guidelines and script sample that follow to produce similar results.
Use the following procedure to measure the number of events per second for all computers in
your organization by using the Events Generated Per Second Script. After you determine the
number of events, you use it this number to calculate the number of disks required to handle I/O
and the total ACS database size as described in the subsequent sections.
38
You will use the total value to calculate the number of disk required to handle I/O and
to calculate the total ACS database size in the following sections.
Calculating the number of disks required to handle I/O During testing at Microsoft, the
estimated average number of logical disk I/O per event for ACS database logs was 1.384 and the
ACS database was 0.138. However, these values may differ slightly depending on the
environment. This assumed that the disk revolutions per minute (RPM) has a 1:1 ratio with the
logical disk I/O and that a RAID 0+1 configuration is used.
You can use the following formulas to calculate the number of disks required to handle I/O.
For the log drives:
[Average number of disk I/O per event for transaction log] * [Events per second for all
computers] / [disk RPM] * 60 sec/minute = [number of required drives] * 2 (for RAID 1)
Values for the preceding variables are described in the following table.
Variable Value
Values for the preceding variables are described in the following table.
Variable Value
If the number of disks required to handle I/O for events exceeds the number of disks you can
have in a disk array, you will need to divide the events into multiple collectors.
Calculating the total ACS Database size
To determine the total ACS database size, use the following formula:
39
[Events per second for all computers] * [0.4 KB, which is the size of event] * 60 sec *60
min * 24 hr /1024 MB /1024 GB /1024 TB * [retention period, which is days to keep in
database] = total size of database
40
The last deliverable that this guide will assist you in developing is an implementation plan.
Lab Testing
An implementation plan is simply a moderately detailed listing of the steps necessary to move the
monitoring environment from wherever it is now, referred to as the "start state," to where you want
it to be, referred to as the "desired end state." There is only one way to develop an
implementation plan properly and that is through lab testing. The goal of lab testing as part of
implementation plan development is to validate configuration and procedures, not to prove out
scalability, as it is usually cost prohibitive to fully model the production environment with all its
complexity and load in a lab setting.
Start your lab design by identifying the critical components in your production environment that
support the monitoring environment, such as Active Directory and DNS. Also identify components
Operations Manager will interact with, such as applications, servers, and workstations.
Secure hardware that will host the start state lab environment. Because you are not testing for
scale, consider using Microsoft Virtual Server to host these components as virtual machines.
Using Virtual Server has the added advantage of providing the ability to quickly reset the test
environment to a clean start state after a testing run. Build the critical components infrastructure
and other start state components in this environment. Exercise due diligence here to ensure that
the lab environment resembles the production environment as closely as possible. The closer it is
in terms of configuration, services, and data, the more valid the subsequent testing will be.
Next, get the hardware that will be used to support the production implementation of your
management groups and get it up and running in the lab setting. This gives you the opportunity to
confirm that all the hardware is present and working properly. Then compile a rough list of the
steps that will be used to perform the Operations Manager deployment. This completes the
preparatory steps.
Now you should perform the implementation in the lab, step by step, updating the procedures as
you progress. You should expect to encounter issues during this process. The goal here is to
identify as many issues that block the implementation as possible and to develop solutions or
procedures to work around the issues. You should expect to repeat this process many times,
getting a bit further each time and resetting the lab to the start state as necessary.
Once you are able to get successfully through the implementation from start state to desired end
state, you can be sure that you have a reliable and truly useful implementation plan.
Appendix A
ACS Sizing Example
This appendix is a sample walkthrough of generating a sizing estimate for a hypothetical ACS
installation. In this example we assume that the following information has been collected without
any event log filters applied:
The number of security events from a Windows Server domain controller (one of twenty domain
controllers in the environment) was sampled using the Events Generated Per Second script over
41
a 2 day period. The server generated an average of 900,000 events in a given 24 hour period.
Peak event generation occurred between 7:30 A.M. and 10:00 A.M. (150 minutes) when 360,000
events were recorded. [20]*[360,000] / [150 min] / [60 sec] = 800 events per second for all
servers.
The number of disks needed to support the logs was determined by using the disk RPM
(assuming 15,000 RPM), logical disk I/O, and the number of events that occurred per second
values and placing them in the following equation:
1.384*800*60/15000=~5 drives *2 (for RAID 1)=10 drives
The number of disks needed to support the databases was determined by using the disk RPM
(assuming 15,000 RPM), logical disk I/O, and the number of events that occurred per second
values and placing them in the following equation:
0.138*800*60/15000=~1 drive *2 (for RAID 1)=2 drives
The maximum number of disk drives that the disk array controller can support is 8 drives per
array. Therefore, you will need two collectors and two audit databases. The 20 Windows Server
domain controllers will be divided evenly among the two collectors.
The amount of storage to allocate for each database is estimated by taking the size of an average
event collected (0.4 KB), the number of events collected per second, and the duration to store
data values and placing them in the following equation:
900,000*20*0.4KB=6.87GB of data collected per day
Assuming you want to store data for 14 days, you need 96 GB of total storage space, which is 48
GB per audit database.
Events Generated Per Second Script The Microsoft Visual Basic script shown in this section
counts and displays the number of security events generated every second in the local security
log for a computer. For best results, you should run this script locally on the computer where you
are recording security events. However, you can run the script on a remote computer when you
use the target computer name as an argument. You can generate script results by directing the
results to a .csv file. To stop the script, press CTRL+C. Afterward, you can open the .csv file in
Microsoft Excel to perform calculations on the results.
Usage
CScript /nologo SecurityEventPerSecond.vbs >>NumOfEvtsGenPerSec.csv
Or
CScript /nologo SecurityEventPerSecond.vbs <RemoteComputerName>
>>NumOfEvtsGenPerSec.csv
Sample
' *************************************************************
'
' SecurityEventPerSecond.vbs
'
42
' Written by: Joseph Chan (Microsoft Operations Manager Program Manager)
'
'
' This script takes one parameter "Computer". You can specify a
'
' This script does not stop until you stop it manually (Ctrl+C)
'
' *************************************************************
computer = objArgs(0)
Else
computer = "."
End If
Dim currentTime
Do While True
WScript.Sleep(1000)
43
Loop
Err.Clear
count = 0
dateTimeCriteria.SetVarDate(currentTime)
nextSec.SetVarDate(DateAdd("s", 1, currentTime))
& "{impersonationLevel=impersonate,(Security)}!\\" _
WScript.Echo " Error: [" & Err.Number & "] " & Err.Description
Exit Sub
End If
("Select * from Win32_NTLogEvent Where Logfile ='Security' AND TimeGenerated >= " &
strCurrent & " AND TimeGenerated < " & strNext)
44
If Err.Number > 0 then
WScript.Echo " Error: [" & Err.Number & "] " & Err.Description
Exit Sub
End If
'timeGeneratedField.Value = objItem.TimeGenerated
'WScript.Echo " " & timeGeneratedField.GetVarDate & ", " & objItem.EventCode & ", " &
objItem.SourceName & ", " & objItem.User
count = count +1
Next
WScript.Echo " Error: [" & Err.Number & "] " & Err.Description
Exit Sub
End If
End Sub
45