0% found this document useful (0 votes)
26 views69 pages

MS Cluster Server Troubleshooting and Maintenance

This document provides troubleshooting and maintenance techniques for Microsoft Cluster Server (MSCS) version 1.0, focusing on installation, administrative issues, and hardware configurations. It emphasizes the importance of using certified hardware and proper network configurations to ensure high availability and manageability of cluster resources. The content is archived and may contain outdated URLs and technical information, as it was published in 1999.

Uploaded by

shrikantnpar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views69 pages

MS Cluster Server Troubleshooting and Maintenance

This document provides troubleshooting and maintenance techniques for Microsoft Cluster Server (MSCS) version 1.0, focusing on installation, administrative issues, and hardware configurations. It emphasizes the importance of using certified hardware and proper network configurations to ensure high availability and manageability of cluster resources. The content is archived and may contain outdated URLs and technical information, as it was published in 1999.

Uploaded by

shrikantnpar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

MS Cluster Server Troubleshooting and

Maintenance
Archived content. No warranty is made as to technical accuracy. Content may contain URLs that
were valid when originally published, but now link to sites or pages that no longer exist.
Published: May 6, 1999

By Martin Lucas, Microsoft Premier Enterprise Support

On This Page

Abstract
Introduction
Chapter 1: Preinstallation
Chapter 2: Installation Problems
Chapter 3: Post-Installation Problems
Chapter 4: Administrative Issues
Chapter 5: Troubleshooting the shared SCSI bus
Chapter 6: Client Connectivity Problems
Chapter 7: Maintenance
Appendix A: MSCS Event messages
Appendix B: Using AND Reading THE Cluster Logfile
Appendix C: Command-Line Administration
For More Information

Abstract

This white paper details troubleshooting and maintenance techniques for Microsoft® Cluster
Server version 1.0. Because cluster configurations vary, this document discusses techniques in
general terms. Many of these techniques can be applied to different configurations and
conditions.

Top Of Page

Introduction

This white paper discusses troubleshooting and maintenance techniques for the first
implementation of Microsoft® Cluster Server (MSCS) version 1.0. The initial phase of the
product supports a maximum of two servers in a cluster, which are often referred to as nodes.
Since there are so many different types of resources that may be managed within a cluster, it may
be difficult at times for an administrator to determine what component or resource may be
causing failures. In many cases, MSCS can automatically detect and recover from server or
application failures. However, in some cases, it may be necessary to troubleshoot attached
resources or applications.

Clustering and Microsoft Cluster Server (MSCS)

The term clustering has been used for many years within the computing industry. Clustering is a
familiar subject to many users, and seems very complicated based on earlier implementations
that were large, complex, and sometimes difficult to configure. Earlier clusters were a challenge
to maintain without extensive training, and without an experienced administrator.

Microsoft has extended the capabilities of the Microsoft® Windows NT® Server operating
system through the Enterprise Edition. Microsoft® Windows NT® Server, Enterprise Edition,
contains Microsoft Cluster Server (MSCS). MSCS adds clustering capabilities to Windows NT,
to achieve high availability, easier manageability, and greater scalability.

Top Of Page

Chapter 1: Preinstallation

MSCS Hardware Compatibility List (HCL)

The picture above shows part of the installation process that mentions the importance of using
certified hardware for clusters. MSCS uses industry standard hardware. This allows hardware to
be easily added or replaced as needed. Supported configurations will use only hardware validated
using the MSCS Cluster Hardware Compatibility Test (HCT). These tests are above and beyond
the standard compatibility testing for Microsoft Windows NT, and are quite intensive. Microsoft
supports MSCS only when MSCS is used on a validated cluster configuration. Validation is
available only for complete configurations as tested together. The MSCS HCL is available on the
Microsoft Web site at: http://support.microsoft.com/kb/131900.

Configuring the Hardware


The MSCS installation process relies heavily on properly configured hardware. Therefore, it is
important that you configure and test each device before you run the MSCS installation program.
A typical cluster configuration consists of two servers, two network adapters each, local storage,
and one or more shared SCSI buses with one or more disks. While it is possible to configure a
cluster using only one network adapter in each server, you are strongly encouraged to have a
second isolated network for cluster communications. For clusters to be certified, they must have
at least one isolated network for cluster communications. The cluster may also be configured to
use the primary non-isolated network for cluster communications if the isolated network fails.
The cluster nodes must communicate with each other on a time-critical basis. Communication
between nodes is sometimes referred to as the heartbeat. Because is important that the heartbeat
packets be sent and received in a timely manner, only PCI-based network adapters should be
used, because the PCI bus has the highest priority.

Figure 1:

The shared SCSI bus consists of a compatible PCI SCSI adapter in each server, with both
systems connected to the same SCSI bus. One SCSI host adapter uses the default ID 7, and the
other uses ID 6. This ensures that the host adapters have the highest priority on the SCSI bus.
The bus is referred to as the Shared SCSI bus, because both systems share exclusive access to
one or more disk devices on the bus. MSCS controls exclusive access to the device through the
reserve and release commands in the SCSI specification.

Other storage subsystems may be available from system vendors as an alternative to SCSI,
which, in some cases, may offer additional speed or flexibility. Some of these storage types may
require installation procedures other than those specified in the Microsoft Cluster Server
Administrator's Guide. These storage types may also require special drivers or resource DLLs as
provided by the manufacturer. If the manufacturer provides installation procedures for Microsoft
Cluster Server, use those procedures instead of the generic installation directions provided in the
Administrator's Guide.

Installing the Operating System

Before you install Microsoft Windows NT Server, Enterprise Edition, you must decide what role
each computer will have in the domain. As the Administrator's Guide indicates, you may install
MSCS as a member server or as a domain controller. The following information focuses on
performance issues with each configuration:

 The member server role for each cluster node


is a viable solution, but may have a few
drawbacks. While not incurring overhead
from performing authentication for other
systems within the domain, this configuration
remains vulnerable to loss of communication
with domain controllers on the network. Node
to node communications and various registry
operations within the cluster require
authentication from the domain. At times,
during normal operations, the need to receive
authentication may occur. Member servers
rely on domain controllers elsewhere on the
network for this type of authentication. Lack
of connectivity with a domain controller may
severely affect performance, and may also
cause one or more cluster nodes to stop
responding until connection with a domain
controller has been re-established. In a worst
case scenario, loss of network connectivity
with domain controllers may cause complete
failure of the cluster.
 The primary domain controller to backup
domain controller (PDC to BDC)
configuration is a better alternative than the
member server option, because it removes the
need for the cluster node to be authenticated
by an external source. If an activity requires
authentication, either of the nodes can supply
it. Thus, authentication is not a failure point
as it is in the member server configuration.
However, primary domain controllers may
require special configuration in a multihomed
environment. Additionally, the domain
overhead may not be well distributed in this
model because one node may have more
domain activity than the other one.
 The BDC to BDC configuration is the most
favorable configuration, because it provides
authentication, regardless of public network
status, and, the overhead associated with
domain activities is balanced between the
nodes. Additionally, BDCs are easier to
configure in a multihomed environment.

Configuring Network Adapters

In a typical MSCS installation, each server in the cluster, referred to as nodes, will have at least
two network adapters; one adapter configured as the public network for client connections, the
other for private communications between cluster nodes. This second interface is called the
cluster interconnect. If the cluster interconnect fails, MSCS (if so configured) will automatically
attempt to use the public network for communication between cluster nodes. In many two-node
installations, the private network uses a crossover cable or an isolated segment. It is important to
restrict network traffic to only cluster communications on this interface. Additionally, each
server should use PCI network adapters. If you have any ISA, PCMCIA, or other bus
architecture network adapters, these adapters may compete for attention of the CPU in
relationship to other faster PCI devices in the system. Network adapters other than PCI may
cause premature failover of cluster resources, based on delays induced by the hardware.
Complete systems will likely not have these types of adapters. Keep this in mind, if you decide
to add adapters to the configuration.

Follow standard Windows NT configuration guidelines for network adapter configuration. For
example, each network adapter must have an IP address that is on a different network or subnet.
Do not use the same IP address for both network adapters, although they are connected to two
distinctly different physical networks. Each adapter must have a different address, and the
addresses cannot be on the same network. Consider the table of addresses in Figure 2 below.

Adapter 1 (Public Network) Adapter 2 (Private Network) Valid Combination?


192.168.0.1 192.168.0.1 NO
192.168.0.1 192.168.0.2 NO
192.168.0.1 192.168.1.1 YES
192.168.0.1 10.0.0.1 YES

Figure 2

In fact, because of isolation of the private network, you can use just about whatever matching IP
address combination you like for this network. If you want to, you can use addresses that the
Internet Assigned Numbers Authority (IANA) designates for private use. The private use address
ranges are noted in Figure 3.

Address Class Starting Address Ending Address


Class A 10.0.0.0 10.255.255.255
Class B 172.16.0.0 172.31.255.255
Class C 192.168.0.0 192.68.255.255

Figure 3
The first and last addresses are designated as the network and broadcast addresses for the address
range. For example, on the reserved Class C address, the actual range for host addresses is
192.168.0.1 through 192.68.255.254. Use 192.168.0.1 and 192.168.0.2 to keep it simple, because
you'll have only two adapters on this isolated network. Do not declare default gateway and
WINS server addresses for this network. You may need to consult with your network
administrator on use of these addresses, in the event that they may already be in use within your
enterprise.

When you've obtained the proper addresses for network adapters in each system, use the
Network utility in Control Panel to set these options. Use the PING utility from the command
prompt to check each network adapter for connectivity with the loopback address (127.0.0.1), the
card's own IP address, and the IP address of another system. Before you attempt to install MSCS,
make sure that each adapter works properly and can communicate properly on each network.
You will find more information on network adapter configuration in the Windows NT Online
documentation, the Windows NT Server 4.0 Resource Kit, or in the Microsoft Knowledge Base.

The following are related Microsoft Knowledge Base articles regarding network adapter
configuration, TCP/IP configuration, and related troubleshooting:

164015 Understanding TCP/IP Addressing and Subnetting Basics


102908 How to Troubleshoot TCP/IP Connectivity with Windows NT
151280 TCP/IP Does Not Function After Adding a Second Adapter
174812 Effects of Using Autodetect Setting on Cluster NIC
175767 Expected Behavior of Multiple Adapters on Same Network
170771 Cluster May Fail If IP Address Used from DHCP Server
168567 Clustering Information on IP Address Failover
193890 Recommended Wins Configuration for MSCS
217199 Static Wins entries cause the Network Name to go offline.
201616 Network card detection in Microsoft Cluster Server

Configuring the Shared SCSI Bus

In a normal configuration with a single server, the server has a SCSI host adapter that connects
directly to one or more SCSI devices, and each end of the SCSI bus has a bus terminator. The
terminators help stabilize the signals on the bus and help ensure high-speed data transmission.
They also help eliminate line noise.

Configuring Host Adapters

The shared SCSI bus, as used in a Microsoft cluster, differs from most common SCSI
implementations in one way: the shared SCSI bus uses two SCSI host adapters. Each cluster
node has a separate SCSI host adapter for shared access to this bus, in addition to the other disk
controllers that the server uses for local storage (or the operating system). As with the SCSI
specification, each device on the bus must have a different ID number. Therefore, the ID for one
of these host adapters must be changed. Typically, this means that one host adapter uses the
default ID of 7, while the other adapter uses ID 6.

Note: It is important to use ID 6 and 7 for the host adapters on the shared bus so that they have
priority over other connected devices on the same channel. A cluster may have more than one
shared SCSI bus as needed for additional shared storage.

SCSI Cables

SCSI bus failures can be the result of reduced quality cables. Inexpensive cables may be
attractive because of the low price, but may not be worth the headache associated with them. An
easy comparison between the cheaper cables and the expensive ones can be done by holding a
cable in each hand, about 10 inches from the connector. Observe the arc of the cable. The higher
quality cables don't bend very well in comparison. These cables use better shielding than the
other cables, and may use different gauge wire. If you use the less expensive cables, you may
spend more supporting them than it would cost to buy the better quality cables in the first place.
This shouldn't be much of a concern for complete systems purchased from a hardware vendor.
These certified systems likely have matched cable sets. In the event you ever need to replace one
of these cables, consult with your hardware vendor.

Some configurations may use standard SCSI cables, while others may use Y cables (or adapters).
The Y cables are recommended for the shared SCSI bus. These cables allow bus termination at
each end, independent of the host adapters. Some adapters do not continue to provide bus
termination when turned off, and also cannot maintain bus termination if they are disconnected
for maintenance. Y cables avoid these points of failure and help achieve high availability.

Even with high quality cables, it is important to consider total cable length. Transfer rate, the
number of connected SCSI devices, cable quality, and termination may influence the total
allowable cable length for the SCSI bus. While it is common knowledge that a standard SCSI
bus using a 5-megabit transfer rate may have a maximum total cable length of approximately 6
meters, the maximum length decreases as the transfer rate increases. Most SCSI devices on the
market today achieve much higher transfer rates and demand a shorter total cable length. Some
manufacturers of complete systems that are certified for MSCS may use differential SCSI with a
maximum total cable length of 25 meters. Consider these implications when adding devices to an
existing bus or certified system. In some cases, it may be necessary to install another shared
SCSI bus.

SCSI Termination

Microsoft recommends active termination for each end of the shared SCSI bus. Passive
terminators may not reliably maintain adequate termination under certain conditions. Be sure to
have an active terminator at each end of the shared SCSI bus. A SCSI bus has two ends and must
have termination on each end. For best results, do not rely on automatic termination provided by
host adapters or newer SCSI devices. Avoid duplicate termination and avoid placing termination
in the middle of the bus.
Drives, Partitions, and File Systems

Whether you use individual SCSI disk drives on the shared bus, shared hardware RAID arrays,
or a combination of both, each disk or logical drive on the shared bus needs to be partitioned and
formatted before you install MSCS. The Microsoft Cluster Server Administrator's Guide covers
the necessary steps to perform this procedure. In most cases, a drive contains only one partition.
Some RAID controllers can partition arrays as multiple logical drives, or as a single large
partition. In the case of a single large partition, you will probably prefer to have a few logical
drives for your data: one drive or disk for each group of resources, with one drive designated as
the quorum disk.

If you partition drives at the operating system level into multiple partitions, remember that all
partitions on shared disks move together from one node to another. Thus, physical drives are
exclusively owned by one node at a time. In turn, all partitions on a shared disk are owned by
one node at a time. If you transfer ownership of a drive to another node through MSCS, the
partitions move in tandem, and may not be split between nodes. Any partitions on shared drives
must be formatted with the NTFS file system, and must not be members of any software-based
fault tolerant sets.

CD-ROM Drives and Tape Drives

Do not connect CD-ROM drives, tape drives, or other non-physical disk devices to the shared
SCSI bus. MSCS version 1.0 only supports non-removable physical disk drives that are listed on
the MSCS HCL. The cluster disk driver may or may not recognize other device types. If you
attach unsupported devices to the shared bus, the unsupported devices may appear usable by the
Windows NT operating system. However, because of SCSI bus arbitration between the two
systems and the use of SCSI resets, these devices may experience problems if attached to the
shared SCSI bus. These devices may also create issues for other devices on the bus. For best
results, attach the noncluster devices to a separate controller not used by the cluster.

Preinstallation Checklist

Before you install MSCS, there are several items to check to help ensure proper operation and
configuration. After proper configuration and testing, most installations of MSCS should
complete without error. The following checklist is fairly general. It may not include all possible
system options that you need to evaluate before installation:

 Use only certified hardware as listed on the


MSCS Hardware Compatibility List (HCL).
 Determine which role these servers will play
in the domain. Will each server be a domain
controller or a member server?
Recommended role: backup domain
controller (BDC).
 Install Microsoft Windows NT Server,
Enterprise Edition, on both servers.
 Install Service Pack 3 on each server.
 Verify cables and termination of the shared
SCSI bus.
 Check drive letter assignment and NTFS
formatting of shared drives with only one
server turned on at a time.
 If both systems have ever been allowed to
access drives on the shared bus at the same
time (without MSCS installed), the drives
must be repartitioned and reformatted prior to
the next installation. Failure to do so may
result in unexpected file system corruption.
 Ensure only physical disks or hardware raid
arrays are attached to the shared SCSI bus.
 Make sure that disks on the shared SCSI bus
are not members of any software fault
tolerance sets.
 Check network connectivity with the primary
network adapters on each system.
 Evaluate network connectivity on any
secondary network adapters that may be used
for private cluster communications.
 Ensure that the system and application event
logs are free of errors and warnings.
 Make sure that each server is a member of the
same domain, and that you have
administrative rights to each server.
 Ensure that each server has a properly sized
pagefile and that the paging files reside only
on local disks. Do not place pagefiles on any
drives attached to the shared SCSI bus.
 Determine what name you will use for the
cluster. This name will be used for
administrative purposes within the cluster and
must not conflict with any existing names on
the network (computer, server, printer,
domain, and so forth). This is not a network
name for clients to attach to.
 Obtain a static IP address and subnet mask for
the cluster. This address will be associated
with the cluster name. You may need
additional IP addresses later for groups of
resources (virtual servers) within the cluster.
 Set multi-speed network adapters to a specific
speed. Do not use the autodetect setting if
available. For more information, see the
Microsoft Knowledge Base article 174812.
 Decide the name of the folder and location for
cluster files to be stored on each server. The
default location is %WinDir%\Cluster, where
%WinDir% is your Windows NT folder.
 Determine what account the cluster service
(ClusSvc) will run under. If you need to
create a new account for this purpose, do so
before installation. Make the domain account
a member of the local Administrators group.
Though the Domain Admins group may be a
member of the Administrators group, this is
not sufficient. The account must be a direct
member of the Administrators group. Do not
place any password restrictions on the
account. Also ensure the account should also
have the Logon as a service and Lock pages
in memory rights.

Installation on systems using custom disk hardware

If your hardware uses other than standard SCSI controllers and requires special drivers and
custom resource types, use the software and installation instructions as provided by the
manufacturer. Use of the standard installation procedures for MSCS will fail on these systems as
they require additional device drivers and DLLs as supplied by the manufacturer. These systems
also require special cabling.

Top Of Page

Chapter 2: Installation Problems

The installation process for Microsoft Cluster Server (MSCS) is very simple compared to other
network server applications. The MSCS installation completes within a short timeframe.
Installation usually lasts just a few minutes. For a software package that does so much, the speed
with which MSCS installs might surprise you. In reality, MSCS is more complex behind the
scenes, and installation depends greatly on the compatibility and proper configuration of the
system hardware and networks. If the hardware configuration is not acceptable, it is not unusual
to expect installation problems. After installation, be sure to evaluate the proper operation of the
entire cluster prior to installing additional software.

MSCS Installation Problems with the First Node

Is Hardware Compatible?
It is important to use certified systems for MSCS installations. Use systems and components
from the MSCS Hardware Compatibility List (HCL). For many, the main reason for installing a
cluster is to achieve high availability of their valuable resources. Why compromise availability
by using unsupported hardware? Microsoft supports only MSCS installations that use certified
complete systems from the MSCS Hardware Compatibility List. If the system fails and you need
support, if the hardware isn't supported, high availability may be compromised.

Is the Shared SCSI Bus Connected and Configured Properly?

MSCS relies heavily on the shared SCSI bus. You must have at least one device on the shared
bus for the cluster to store the quorum logfile and act as the cluster's quorum disk. Access to this
disk is vital to the cluster. In the event of a system failure or loss of network communication
between nodes, cluster nodes will arbitrate for access to the quorum disk to determine which
system will take control and make decisions. The quorum logfile holds information regarding
configuration changes made within the cluster when another node may be offline or unreachable.
The installation process requires at least one device on the shared bus for this purpose. A
hardware RAID logical partition or separate physical disk drive will be sufficient to store the
quorum logfile and function as the quorum disk..

To check proper operation of the shared SCSI bus, consult the section "Troubleshooting shared
SCSI bus" later in this document.

Install Windows NT Server, Enterprise Edition, and Service Pack 3

MSCS version 1.0 requires Microsoft Windows NT Server, Enterprise Edition, version 4.0 with
Service Pack 3 or later. If you add network adapters or other hardware devices and drivers later,
it's important to reapply the service pack to ensure that all drivers, DLLs, and system
components are of the same version. Hotfixes may require reapplication if they are overwritten.
Check with Microsoft Product Support Services or the Microsoft Knowledge Base regarding
applied hotfixes, and to determine whether the hotfix needs to be reapplied.

Does the System Disk Have Adequate Free Space to Install the Product?

MSCS requires only a few megabytes to store files on each system. The Setup program prompts
for the path to store these files. The path should be to local storage on each server, not to a drive
on the shared SCSI bus. Make sure that free space exists on the system disk, both for installation
requirements and for normal system operation.

Does the Server Have a Properly Sized System Paging File?

If you've experienced reduced system performance or near system lockup during the installation
process, check the Performance tab using the System utility of the Control Panel. Make sure the
system has acceptable paging file space (the minimum space required is the amount of physical
RAM plus 11 MB.), and that the system drive has enough free space to hold a memory dump
file, should a system crash occur. Also, make sure pagefiles are on local disks only, not on
shared drives. Performance Monitor may be a valuable resource for troubleshooting virtual
memory problems.

Do Both Servers Belong to the Same Domain?

Both servers in the cluster must have membership in the same domain. Also, the service account
that the cluster service uses must be the same on both servers. Cluster nodes may be domain
controllers or domain member servers. However, if functioning as a domain member server, a
domain controller must be accessible for cluster service account authentication. This is a
requirement for any service that starts using a domain account.

Is the Primary Domain Controller (PDC) Accessible?

During the installation process, Setup must be able to communicate with the PDC. Otherwise,
the setup process will fail. Additionally, after setup, the cluster service may not start if domain
controllers are unavailable to authenticate the cluster service account. For best results, make sure
each system has connectivity with the PDC, and install each node as a backup domain controller
in the same domain.

Are You Installing While Logged On as an Administrator?

To install MSCS, you must have administrative rights on each server. For best results, log on to
the server with an administrative account before you start Setup.

Do the Drives on the Shared SCSI Bus Appear to Be Functioning Properly?

Devices on the shared SCSI bus must be turned on, configured, and functioning properly.
Consult the Microsoft Cluster Server Administrator's Guide for information on testing the drives
before setup.

Are Any Errors Listed in the Event Log?

Before you install new software of any kind, it is good practice to check the system and
application event logs for errors. This resource can indicate the state of the system before you
make configuration changes. Events may be posted to these logs in the event of installation
errors or hardware malfunctions during the installation process. Attempt to correct any problems
you find. Appendix A of this document contains information regarding some events that may be
related to MSCS and possible resolutions.

Is the Network Configured and Functioning Properly?

MSCS relies heavily on configured networks for communications between cluster nodes, and for
client access. With improper function or configuration, the cluster software cannot function
properly. The installation process attempts to validate attached networks and needs to use them
during the process. Make sure that the network adapters and TCP/IP protocol are configured
properly with correct IP addresses. If necessary, consult with your network administrator for
proper addressing.

For best results, use statically assigned addresses and do not rely on DHCP to supply addresses
for these servers. Also, make sure you're using the correct network adapter driver. Some adapter
drivers may appear to work, because they are similar enough to the actual driver needed but are
not an exact match. For example, an OEM or integrated network adapter may use the same
chipset as a standard version of the adapter. Use of the same chipset may cause the standard
version of the driver to load instead of an OEM supplied driver. Some of these adapters work
more reliably with the driver supplied by the OEM, and may not attain acceptable performance if
using the standard driver. In some cases, this combination may prevent the adapter from
functioning at all, even though no errors appear in the system event log for the adapter.

Cannot Install MSCS on the Second Node

The previous section, "MSCS Installation Problems with the First Node," contains questions you
need to ask if installation on the second node fails. Please consult this section first, before you
continue with additional troubleshooting questions in this section.

During Installation, Are You Specifying the Same Cluster Name to Join ?

When you install the second node, select the Join an Existing Cluster option. The first node you
installed must be running at the time with the cluster service running.

Is the RPC Service Running on Both Systems?

MSCS uses remote procedure calls (RPC) and requires that the RPC service be running on both
systems. Check to make sure that the RPC service is running on both systems and that the system
event logs on each server do not have any RPC-related errors.

Can Each Node Communicate with One Another Over Configured Networks?

Evaluate network connectivity between systems. If you used the procedures in the preinstallation
section of this document, then you've already covered the basics. During installation of the
second node, the installation progam communicates through the server's primary network and
through any other networks that were configured during installation of the first node. Therefore,
you should test connectivity again with the IP addresses on these adapters. Additionally, the
cluster name and associated IP address you configured earlier will be used. Make sure the cluster
service is running on the first node and that the cluster name and cluster IP address resources are
online and available. Also, make sure that the correct network was specified for the cluster IP
address when the first node was installed. The cluster service may be registering the cluster name
on the wrong network. The cluster name resource should be registered on the network that clients
will use to connect to the cluster.

Are Both Nodes Connected to the Same Network or Subnet?


Both nodes need to use unique addresses on the same network or subnet. The cluster nodes need
to be able to communicate directly, without routers or bridges between them. If the nodes are not
directly connected to the same public network, it will not be possible to failover IP addresses.

Cannot Reinstall MSCS After Node Evicted

If you evict a node from the cluster, it may no longer participate in cluster operations. If you
restart the evicted node and have not removed MSCS from it, the node will still attempt to join,
and cluster membership will be denied. You must remove MSCS with the Add/Remove
Programs utility in Control Panel. This action requires that you restart the system. If you ignore
the option to restart, and attempt to reinstall the software anyway, you may receive the following
error message:

If you receive this message, restart the affected system and reinstall the MSCS software to join
the existing cluster.

Top Of Page

Chapter 3: Post-Installation Problems

As you troubleshoot or perform cluster maintenance, it may be possible to keep resources


available on one of the two nodes. If you are able to use at least one of the nodes for resources
while troubleshooting, you may be able to keep as many resources available to users during
administrative activity. In some cases, it may be desirable to run with some unavailable resources
rather than none at all.

The most likely causes for one or all nodes to be down are usually related to the shared SCSI
bus. If only one node is down, check for SCSI-related problems or for communication problems
between the nodes. These are the most likely sources of problems that lead to node failures.

Entire Cluster Is Down

If the entire cluster is down, try to bring at least one node online. If you can achieve this goal, the
affect on users may be substantially reduced. When a node is online, gather event log data or
other information that may be helpful to troubleshoot the failure. Check for the existence of a
recent Memory.dmp file that may have been created from a recent crash. If necessary, contact
Microsoft Product Support Services for assistance with this file.
One Node Is Down

If a single node is unavailable, make sure that resources and groups are available on the other
node. If they are, begin troubleshooting the failed node. Try to bring it up and gather error data
from the event log or cluster diagnostic logfile.

Applying Service Packs and Hotfixes

If you're applying service packs or hotfixes, avoid applying them to both nodes at one time,
unless otherwise directed by release notes, KB articles, or other instructions. It may be possible
to apply the updates to a single node at a time to avoid rendering both nodes unavailable for a
short or long duration. More information on this topic may be found in Microsoft Knowledge
Base article 174799, "How to Install Service Packs in a Cluster."

One or More Servers Quit Responding

If one or more servers are not responding but have not crashed or otherwise failed, the problem
may be related to configuration, software, or driver issues. You can also check the shared SCSI
bus or connected disk devices.

If the servers are installed as member servers (non-domain controllers), it is possible that one or
both nodes may stop responding if connectivity with domain controllers becomes unavailable.
Both the cluster service and other applications use remote procedure calls (RPCs). Many RPC-
related operations require domain authentication. As cluster nodes must participate in domain
security, it is necessary to have reliable domain authentication available. Check network
connectivity with domain controllers and for other network problems. To avoid this potential
problem, it is preferred that the nodes be installed as backup domain controllers (BDC). The
BDC configuration allows each node to perform authentication for itself despite problems that
could exist on a wide area network (WAN).

Cluster Service Will Not Start

There are a variety of conditions that could prevent the Cluster Service (ClusSvc) from starting.
Many of these conditions may be the result of configuration or hardware related problems. The
first things to check when diagnosing this condition are the items on which the Cluster Service
depends . Many of these items may be referenced in the Chapter 1: section of this document.
Common causes for this problem with error messages are noted below.

Check the service account under which ClusSvc runs. This domain account needs to be a
member of the local adminstrators group on each server. The account needs the Logon as a
service and Lock pages in memory rights. Make sure the account is not disabled and that
password expiration is not a factor. If the failure is because of a problem related to the service
account, the Service Control Manager (SCM) will not allow the service to load, much less run.
As a result, if you've enabled diagnostic logging for the Cluster Service, no new entries will be
written to the log, and a previous logfile may exist. Failures related to the service account may
result in Event ID 7000, or in Event ID 7013 errors in the event log. In addition, you may receive
the following pop-up error message:

Could not start the Cluster Service on \\computername. Error 1069: The service did not start
because of a logon failure.

Check to make sure the quorum disk is online and that the shared SCSI bus has proper
termination and proper function. If the quorum disk is not accessible during startup, the
following popup error message may occur:

Could not start the Cluster Service on \\computername. Error 0021: The device is not ready.

Also, if diagnostic logging for the Cluster Service is enabled, the logfile entries may indicate
problems attaching to the disk. See Appendix B for more information and a detailed example of
the logfile entries for this condition, Example 1: Quorum Disk Turned Off.

If the Cluster Service is running on the other cluster node, check the cluster logfile (if it is
enabled) on that system for indications of whether or not the other node attempted to join the
cluster. If the cluster logfile did try to join the cluster, and the request was denied, the logfile
may contain details of the event. For example, if you evict a node from the cluster, but do not
remove and reinstall MSCS on that node, when the server attempts to join the cluster, the request
to join will be denied. The following are sample error messages and event messages:

Could not start the Cluster Service on \\computername. Error 5028: Size of job is %1 bytes.

Event ID 1009, Event ID 1063, Event ID 1069, Event ID 1070, Event ID 7023

For examples of logfile entries for this type of failure, see the Example 4: Evicted Node
Attempts to Join Existing Cluster section in Appendix B of this document.

If the Cluster Service won't start, check the event log for Event 7000 and 7013. These events
may indicate a problem authenticating the Cluster Service account. Make sure the password
specified for the Cluster Service account is correct. Also make sure that a domain controller is
available to authenticate the account, if the servers are non-domain controllers.

Cluster Service Starts but Cluster Administrator Won't Connect

If the Services utility in Control Panel indicates that the service is running, and you cannot
connect with Cluster Administrator to administer the cluster, the problem may be related to the
Cluster Network Name or to the cluster IP address resources. There may also be RPC-related
problems. Check to make sure the RPC Service is running on both nodes. If it is, try to connect
to a known running cluster node by the computer name. This is probably the best name to use
when troubleshooting to avoid RPC timeout delays during failover of the cluster group. If
running Cluster Administrator on the local node, you may specify a period (.) in place of the
name when prompted. This will create a local connection and will not require name resolution.
If you can connect through the computer name or ".", check the cluster network name and cluster
IP address resources. Make sure that these and other resources in the cluster group are online.
These resources may fail if a duplicate name or IP address on the network conflicts with either of
these resources. A duplicate IP address on the network may cause the network adapter to shut
down. Check the system event log for errors.

Examples of logfile entries for this type of failure may be found in the Example 3: Duplicate
Cluster IP Address section in Appendix B of this document.

Group/Resource Failover Problems

The typical reason that a group may not failover properly is usually because of problems with
resources within the group. For example, if you elect to move a group from one node to another,
the resources within the group will be taken offline, and ownership of the group will be
transferred to the other node. On receiving ownership, the node will attempt to bring resources
online, according to dependencies defined for the resources. If resources fail to go online, MSCS
attempts again to bring them online. After repeated failures, the failing resource or resources may
affect the group and cause the group to transition back to the previous node. Eventually, if
failures continue, the group or affected resources may be taken offline. You can configure the
number of attempts and allowed failures through resource and group properties.

When you experience problems with group or resource failover, evaluate which resource or
resources may be failing. Determine why the resource won't go online. Check resource
dependencies for proper configuration and make sure they are available. Also, make sure that the
"Possible Owners" list includes both nodes. The "Preferred Owners" list is designed for
automatic failback or initial group placement within the cluster. In a two-node cluster, this list
should only contain the name of the preferred node for the group, and should not contain
multiple entries.

If resource properties do not appear to be part of the problem, check the event log or cluster
logfile for details. These files may contain helpful information related to the resource or
resources in question.

Physical Disk Resource Problems

Problems with physical disk resources are usually hardware related. Cables, termination, or SCSI
host adapter configuration may cause problems with failover, or may cause premature failure of
the resource. The system event log may often show events related to physical disk or controller
problems. However, some cable or termination problems may not yield such helpful information.
It is important to verify the configuration of the shared SCSI bus and attached devices, whenever
you detect trouble with one of these devices. Marginal cable connections or cable quality can
cause intermittent failures that are difficult to troubleshoot. BIOS or firmware problems might
also be factors.

Quorum Resource Failures


If the Cluster Service won't start because of a quorum disk failure, check the corresponding
device. If necessary, use the -fixquorum startup option for the Cluster Service, to gain access to
the cluster and redesignate the quorum disk. This process may be necessary if you replace a
failed drive, or attempt to use a different device in the interim. To view or change the quorum
drive settings, right-click the cluster name at the top of the tree, listed on the left portion of the
Cluster Administrator window, and select Properties. The Cluster Properties window contains
three different tabs, one of which is for the quorum disk. From this tab, you may view or change
quorum disk settings. You may also re-designate the quorum resource. More information on this
topic may be found in Microsoft Knowledge Base article 172944, "How to Change Quorum Disk
Designation."

Failures of the quorum device while the cluster is in operation are usually related to hardware
problems, or to configuration of the shared SCSI bus. Use troubleshooting techniques to evaluate
proper operation of the shared SCSI bus and attached devices.

File Share Won't Go Online

For a file share to reach online status, the dependent resources must exist and be online. The path
for the share must exist. Permissions on the file share directory must also include at least Read
access for the Cluster Service account.

Problems Accessing Drive

If you attempt to access a shared drive through the drive letter, it is possible that you may receive
the Incorrect Function error. The error may be a result of the drive not being online on the node
you're accessing it from. The drive may be owned by another cluster node and may be online.
Check Cluster Administrator for ownership of the resource and online status. If necessary,
consult the Physical Disk Resource Problems section of this document. The error could also
indicate drive or controller problems.

Top Of Page

Chapter 4: Administrative Issues

Cannot Connect to Cluster Through Cluster Administrator

If you try to administer the cluster from a remote workstation, the most common way to do so
would be to use the network name you defined during the setup process as Cluster Name. This
resource is located in the Cluster Group. Cluster Administrator needs to establish a connection
using RPC. If the RPC service has failed on the cluster node that owns the Cluster Group, it will
not be possible to connect through the Cluster Name or the name of the computer. Try to
connect, instead, using the computer names of each cluster node. If this works, this indicates a
problem with either the IP address or Network Name resources in the Cluster Group. There may
also be a name resolution problem on the network that may prevent access through the Cluster
Name.

Failure to connect using the Cluster Name or computer names of either node may indicate
problems with the server, with RPC connectivity, or with security. Make sure that you are logged
on with an administrative account in the domain, and that the account has access to administer
the cluster. Access may be granted to additional accounts by using Cluster Administrator on one
of the cluster nodes. For more information on controlling administrative access to the cluster, see
"Specifying Which Users Can Administer a Cluster" in the MSCS Administrator's Guide.

If Cluster Administrator cannot connect from the local console of one of the cluster nodes, check
to see if the Cluster Service is started. Check the system event log for errors. You may want to
enable diagnostic logging for the Cluster Service. If the problem occurs after recently starting the
system, wait 30 to 60 seconds for the Cluster Service to start, and then try to run Cluster
Administrator again.

Cluster Administrator Loses Connection or Stops Responding on Failover

The Cluster Administrator application uses RPC communications to connect with the cluster. If
you use the Cluster Name to establish the connection, Cluster Administrator may appear to stop
responding during a failover of the Cluster Group and its resources. This normal delay occurs
during the registration of the IP address and network name resources within the group, and the
establishment of a new RPC connection. If a problem occurs with the registration of these
resources, the process may take extended time until these resources become available. The first
RPC connection must timeout before the application attempts to establish another connection. As
a result, Cluster Administrator may eventually time out if there are problems bringing the IP
address or network name resources online within the Cluster Group. In this situation, try to
connect using the computer name of one of the cluster nodes, instead of the cluster name. This
usually allows a more real-time display of resource and group transitions without delay.

Cannot Move a Group

To move a group from one node to another, you must have administrative rights to run Cluster
Administrator. The destination node must be online and the cluster service started. The state of
the node must be online and not Paused. In a paused state, the node is a fully active member in
the cluster, but cannot own or run groups.

Both cluster nodes should be listed in the Possible Owners list for the resources within the group;
otherwise the group may only be owned by a single node and will not fail over. While in some
configurations, this restriction may be intentional, in most it would be a mistake as it would
prevent the entire group from failing over. Also, to move a group, resources within the group
cannot be in a pending state. To initiate a Move Group request, resources must be in one of the
following three states: online, offline, or failed.

Cannot Delete a Group

To properly delete a group from the cluster, the group must not contain resources. You may
either delete the resources contained within the group, or move them to another group in the
cluster.

Problems Adding, Deleting, or Moving Resources

Adding Resources

Resources are usually easy to add. However, it is important to understand various resource types
and their requirements. Some resource types may have prerequisites for other resources that must
exist within the same group. As you work with MSCS, you may become more familiar with
these and what they are. You may find that a resource depends on one or more resources within
the same group. Examples might include IP addresses, network names, or physical disks. The
resource wizard will typically indicate mandatory requirements for other resources. However, in
some cases, it may be a good idea to add resources to the dependency list, as they are related.
While Cluster.exe may allow the addition of resources and groups, the command-line utility does
not impose the dependency or resource property constraints like the Cluster Administrator,
because these activities may consist of multiple commands.

For example, suppose you want to create a network name resource in a new group. If you try to
create the network name resource first, the wizard will indicate that it depends on an IP address
resource. The wizard lists available resources in the group from which you select. If this is a new
group, the list may be empty. Therefore, you will need to create the required IP address resource
before you create the network name.

If you create another resource in the group and make it dependent on the network name resource,
the resource will not go online without the network name resource in an online state. A good
example might be a File Share resource. Thus, the share will not be brought online until the
network name is online. Because the network name resource depends on an IP address resource,
it would be repetitive to make the share also dependent on the same IP address. The established
dependency with the network name implies a dependency on the address. You can think of this
as a cascading dependency.

You might ask, "What about the disk where the data will be? Shouldn't the share depend on the
existence or online status of the disk?" Yes, you should create a dependency on the physical disk
resource, although this dependency is not required. If the resource wizard did impose this
requirement, it would imply that the only data source that could be used for a file share would be
a physical disk resource on the shared SCSI bus. For volatile data, shared storage is the way to
go, and a dependency should be created for it. This way, if the disk experiences a momentary
failure, the share will be taken offline and restored when the disk becomes available. However,
without a requirement for dependency on a physical disk resource, this grants the administrator
additional flexibility to use other disk storage for holding data. Use of non-physical disk data
storage for the share implies that for it to be moved to the other node, equivalent storage and the
same drive letter with the same information must also be available there. Further, there must be
some method of data replication or mirroring for this type of storage, if the data is volatile. Some
third parties may have solutions for this situation. Use of local storage in this manner is not
recommended for read/write shares. For read-only information, the two data sources can remain
in sync, and problems with out-of-sync data are avoided.

If you use a shared drive for data storage, make sure to establish the dependency with the share
and with any other resources that depend on it. Failure to do so may cause erratic or undesired
behavior of resources that depend on the disk resource. Some applications or services that rely on
the disk may terminate as a result of not having the dependency.

If you use Cluster.exe to create the same resources, note that it is possible to create a network
name resource without the required IP address resource. However, the network name will not go
online, and will generate errors from such an attempt.

Using the Generic Application/Service Resources for Third-Party Applications

While some third-party service may require modification for use within a cluster, many services
may function normally while controlled by the generic service resource type as provided with
MSCS. If you have a program that runs as an application on the server's desktop that you want to
be highly available, you may be able to use the generic application resource type to control this
application within the cluster.

The parameters for each of these generic resource types are similar. However, when planning to
have MSCS manage these resources, it is necessary to first be familiar with the software and
with the resources that software requires. For example, the software might create a share of some
kind for clients to access data. Most applications need access to their installation directory to
access DLL or INI files, to access stored data, or, perhaps, to create temporary files. In some
cases, it may be wise to install the software on a shared drive in the cluster, so that the software
and necessary components may be available to either node, if the group that contains the service
moves to another cluster node.

Consider a service called SomeService. Assume this is a third-party service that does something
useful. The service requires that the share, SS_SHARE, must exist, and that it maps to a
directory called DATA beneath the installation directory. The startup mode for the service is set
for AUTOMATIC, so that the service will start automatically after the system starts. Normally,
the service would be installed to C:\SomeService, and it stores dynamic configuration details in
the following registry key:

HKEY_LOCAL_MACHINE \Software \SomeCompany \SomeService

If you wanted to configure MSCS to manage this service and make it available through the
cluster, you would probably take the following actions:

1. Create a group using Cluster Administrator.


You might call it SomeGroup to remain
consistent with the software naming
convention.
2. Make sure the group has a physical disk
resource to store the data and the software, an
IP address resource, and a network name
resource. For the network name, you might
use something like SomeServer, for clients to
access the share that will be in the group.
3. Install the software on the shared drive (drive
Y, for example).
4. Using Cluster Administrator, create a File
Share resource in the group named
SS_SHARE, . Make the file share resource
dependent on the physical disk and network
name. If either of these resources fails or goes
offline, you want the share to follow the state
of either dependent resource. Set the path to
the Data directory on the shared drive.
According to what you know about the
software, this should be Y:\SomeService\
Data.
5. Set the startup mode for the service to
MANUAL. Because MSCS will be
controlling the service, the service does not
need to start itself before MSCS has a chance
to start and bring the physical disk and other
resources online.
6. Create a generic service resource in the
group. The name for the resource should be
descriptive for what it corresponds to. You
might want to call it SomeService, to match
the service name. Allow both cluster nodes as
possible owners. Make the resource
dependent on the physical disk resource and
network name. Specify the service name and
any necessary service parameters. Click to
select the, Use network name for computer
name option . This will cause the
application's API call requesting the computer
name to return the network name in the
group. Specify to replicate the registry key by
adding the following line under the Registry
Replication tab: Software\SomeCompany\
SomeService.
7. Bring all the resources in the group online
and test the service.
8. If the service works correctly, stop the service
by taking the generic service offline.
9. Move the group to the other node.
10. Install the service on the other node using the
same parameters and installation directory on
the shared drive.
11. Make sure to set the startup mode to
MANUAL using the Devices utility in
Control Panel.
12. Bring all the required resources and the
generic service resource online, and test the
service.

Note: If you evict a node from the cluster at any time, and have to completely reinstall a cluster
node from the beginning, you will likely need to repeat steps 10 through 12 on the node if you
add it back to the cluster. The procedure described here is generic in nature, and may be
adaptable to various applications. If you are uncertain how to configure a service in the cluster,
contact the application software vendor for more information.

Applications follow a similar procedure, except that you must substitute the generic application
resource type for the generic service resource type used in the above procedure. If you have a
simple application that is already installed on both systems, then you may adapt the following
steps to the procedure previously described:

1. Create a generic application resource in a


group. For this example, we will make
Notepad.exe a highly available application.
2. For the command line, specify c:\WinNT\
System32\Notepad.exe (or different
directory, depending on your Windows NT
installation directory). The path must be the
same on both cluster nodes. Be sure to
specify the working directory as needed and
click to select the Allow application to
interact with the desktop option, so that
Notepad.exe isn't put in the background
3. Skip the Registry Replication tab, because
Notepad.exe does not have registry keys
requiring replication
4. Bring the resource online and notice that it
appears on the desktop. Choose Move
Group, and the application should appear on
the other node's desktop.

Some cluster-aware applications may not require this type of setup and they may have setup
wizards to create necessary cluster resources.

Deleting Resources

Some resources may be difficult to delete if any cluster nodes are offline. For example, you may
be able to delete an IP address resource if only one cluster node is online. However, if you try to
delete a physical disk resource while in this condition, an error message dialog box may appear
similar to the following:

Physical disk resources affect the disk configuration on each node in the cluster and must be
dealt with accordingly on each system at the same time. Therefore, all cluster nodes must be
online to remove this type of resource from the cluster.

If you attempt to remove a resource on which other resources depend, a dialog box listing the
related resources will be displayed. These resources will also be deleted, as they are linked by
dependency to the individual resource chosen for removal. To avoid removal of these resources,
first change or remove the configured dependencies.

Moving Resources from One Group to Another


To move resources from one group to another, both groups must be owned by the same cluster
node. Attempts to move resources between groups with different owners may result in the
following pop-up error message:

To move resources between groups, the groups must have the same owner. This situation may be
easily corrected by moving one of the groups so that both groups have the same owner. Equally
important is the fact that resources to be moved may have dependent resources. If a dependency
exists between the resource to be moved and another resource, a prompt may appear that lists
related resources that need to move with the resource:

Problems moving resources between groups other than those mentioned in this section may be
caused by system problems or configuration-related issues. Check event logs or cluster logfiles
for more information that may relate to the resource in question.

Chkdsk and Autochk

Disks attached to the shared SCSI bus interact differently with Chkdsk and the companion
system startup version of the same program, Autochk. Autochk does not perform Chkdsk
operations on shared drives when the system starts up, even if the operations are needed. MSCS
performs a file system integrity check for each drive, when bringing a physical disk online.
MSCS automatically launches Chkdsk, as necessary.

If you need to run Chkdsk on a drive, consult the following Microsoft Knowledge Base articles:

174617 Chkdsk Runs while Running Microsoft Cluster Server Setup


176970 Chkdsk /f Does Not Run on the Shared Cluster Disk
174797 How to Run CHKDSK on a Shared Drive
Top Of Page

Chapter 5: Troubleshooting the shared SCSI bus

Verifying Configuration

For the shared SCSI bus to work correctly, the SCSI host adapters must be configured correctly.
As with SCSI specifications, each device on the bus must have a unique ID number. For proper
operation, ensure that the host adapters are each set to a unique ID. For best results, set one
adapter to ID 6 and the other adapter to ID 7 to ensure that the host adapters have adequate
priority on the bus. Also, make sure that both adapters have the same firmware revision level.
Because the shared SCSI bus is not used for booting the operating system, disable the BIOS on
the adapter, unless otherwise directed by the hardware vendor.

Make sure that you connect only physical disk or hardware RAID devices to the shared bus.
Devices other than these, such as tape drives, CD-ROM drives, or removable media devices,
should not be used on the shared bus. You may use them on another bus for local storage.

Cables and termination are vital parts of the SCSI bus configuration, and should not be
compromised. Cables need to be of high quality and within SCSI specifications. The total cable
length on the shared SCSI bus needs to be within specifications. Cables supplied with complete
certified systems should be correct for use with the shared SCSI bus. Check for bent pins on
SCSI cable connectors and devices, and ensure that each cable is attached firmly.

Correct termination is also important. Terminate the bus at both ends, and use active terminators.
Use of SCSI Y cables may allow disconnection of one of the nodes from the shared bus without
losing termination. If you have terminators attached to each end of the bus, make sure that the
controllers are not trying to also terminate the bus.

Make sure that all devices connected to the bus are rated for the type of controllers used. For
example, do not attach differential SCSI devices to a standard SCSI controller. Verify that the
controllers can each identify every disk device attached to the bus. Make sure that the
configuration of each disk device is correct. Some newer smart devices can automatically
terminate the bus or negotiate for SCSI IDs. If the controllers do not support this, configure the
drives manually. A mixture of smart devices with others that require manual configuration can
lead to problems in some configurations. For best results, configure the devices manually.

Also, make sure that the SCSI controllers on the shared bus are configured correctly and with the
same parameters (other than SCSI ID). Data transfer rate and other parameter differences
between the two controllers may encourage unpredictable behavior.

Adding Devices to the Shared SCSI Bus

To add disk devices to the shared SCSI bus, you must properly shut down all equipment and both
cluster nodes. This is necessary because the SCSI bus may be disconnected while adding the
device or devices. Attempting to add devices while the cluster and devices are in use may induce
failures or other serious problems that may not be recoverable. Add the new device or devices in
the same way you add a device to a standard SCSI bus. This means you must choose a unique
SCSI ID for the new device, and ensure that the device configuration is correct for the bus and
termination scheme. Verify cable and termination before applying power. Turn on one cluster
node, and use Disk Administrator to assign a drive letter and format each new device. Before
turning on the other node, create a physical disk resource using Cluster Administrator. After you
create the physical disk resource and verify that the resource will go online successfully, turn on
the other cluster node and allow it to join the cluster. Allowing both nodes to be online without
first creating a disk resource for the new device can lead to file system corruption, as both nodes
may have different interpretations of disk structure.

Verifying Cables and Termination

A good procedure for verification of cable and termination integrity is to first use the SCSI host
adapter utilities to determine whether the adapter can identify all disk devices on the bus.
Perform this check with only one node turned on. Then, turn off the computer and perform the
same check on the other system. If this initial check succeeds, the next step is to check drive
identification from the operating system level, with only one of the nodes turned on. If MSCS is
already installed, then the cluster service will need to be started for shared drives to go online.
Check to make sure the shared drives go online. If the device fails, there may be a problem with
the device, or perhaps a cable or termination problem.

Top Of Page

Chapter 6: Client Connectivity Problems

Clients Have Intermittent Connectivity Based on Group Ownership

If clients successfully connect to clustered resources only when a specific node is the owner, a
few possible problems could lead to this condition. Check the system event log on each server
for possible errors. Check to make sure that the group has at least one IP address resource and
one network name resource, and that clients use one of these to access the resource or resources
within the group. If clients connect with any other network name or IP address, they may not be
accessing the correct server in the event that ownership of the resources changes. As a result of
improper addressing, access to these resources may appear limited to a particular node.

If you are able to confirm that clients use proper addressing for the resource or resources, check
the IP address and network name resources to see that they are online. Check network
connectivity with the server that owns the resources. For example, try some of the following
techniques:

From the server:

PING server's primary adapter IP address (on client network)


PING other server's primary adapter IP address (on client network)
PING IP address of the group
PING Network Name of the group
PING Router/Gateway between client and server (if any)
PING Client IP address

If the above tests work correctly up to the router/gateway check, the problem may be elsewhere
on the network because you have connectivity with the other server and local addresses. If tests
complete up to the client IP address test, there may be a client configuration or routing problem.

From the client:

PING Client IP address


PING Router/Gateway between client and server (if any)
PING server's primary adapter IP address (on client network)
PING other server's primary adapter IP address (on client network)
PING IP address of the group
PING Network Name of the group

If the tests from the server all pass, but you experience failures performing tests from the client,
there may be client configuration problems. If all tests complete except the test using the network
name of the group, there may be a name resolution problem. This may be related to client
configuration, or it may be a problem with the client's designated WINS server. These problems
may require network administrator intervention.

Clients Do Not Have Any Connectivity with the Cluster

If clients lose connectivity with both cluster nodes, check to make sure that the Cluster Service is
running on each node. Check the system event log for possible errors. Check network
connectivity between cluster nodes, and with other network devices, by using the procedure in
the previous section. If the Cluster Service is running, and there are no apparent connectivity
problems between the two servers, there is likely a network or client configuration problem that
does not directly involve the cluster. Check to make sure the client uses the TCP/IP protocol, and
has a valid IP address on the network. Also, make sure that the client is using the correct network
name or IP address to access the cluster.

Clients Have Problems Accessing Data Through a File Share

If clients experience problems accessing cluster file shares, first check the resource and make
sure it is online, and that any dependent resources (disks, network names, and so on) are online.
Check the system event log for possible errors. Next, check network connectivity between the
client and the server that owns the resource. If the data for the share is on a shared drive (using a
physical disk resource), make sure that the file share resource has a dependency declared for the
physical disk resource. You can reset the file share by toggling the file share resource offline and
back online again. Cluster file shares behave essentially the same as standard file shares. So,
make sure that clients have appropriate access at both the file system level and the share level.
Also, make sure that the server has the proper number of client access licenses loaded for the
clients connecting, in the event that the client cannot connect because of insufficient available
connections.

Clients Cannot Access Cluster Resources Immediately After IP Address Change


If you create a new IP address resource or change the IP address of an existing resource, it is
possible that clients may experience some delay if you use WINS for name resolution on the
network. This problem may occur because of delays in replication between WINS servers on the
network. Such delays cannot be controlled by MSCS, and must be allowed sufficient time to
replicate. If you suspect there is a WINS-database problem, consult your network administrator,
or contact Microsoft Product Support Services for TCP/IP support.

Clients Experience Intermittent Access

Network adapter configuration is one possible cause of intermittent access to the cluster, and of
premature failover. Some autosense settings for network speed can spontaneously redetect
network speed. During the detection, network traffic through the adapter may be compromised.
For best results, set the network speed manually to avoid the recalibration. Also, make sure to
use the correct network adapter drivers. Some adapters may require special drivers, although
they may be detected as a similar device.

Top Of Page

Chapter 7: Maintenance

Most maintenance operations within a cluster may be performed with one or more nodes online,
and usually without taking the entire cluster offline. This ability allows higher availability of
cluster resources.

Installing Service Packs

Microsoft Windows NT service packs may normally be installed on one node at a time and tested
before you move resources to the node. With this advantage of having a cluster, if something
goes wrong during the update to one node, the other node is still untouched and continuing to
make resources available. As there may be exceptions to the installation of a service pack and
whether or not it can be applied to a single node at a time, consult the release notes for the
service pack for special instructions when installing on a cluster.

Service Packs and Interoperability Issues

To avoid potential issues or compatibility problems with other applications, check the Microsoft
Knowledge Base for articles that may apply. For example, the following articles discuss
installation steps or interoperability issues with Windows NT Option Pack, Microsoft SQL
Server, and Windows NT Service Pack 4:

218922 Installing NTOP on Cluster Server with SP4


223258 How to install NTOP on MSCS 1.0 with SQL
223259 How to install FTP from NTOP on Microsoft Cluster Server 1.0
191138 How to install Windows NT Option Pack on Cluster Server

Replacing Adapters
Adapter replacement may usually be performed after moving resources and groups to the other
node. If replacing a network adapter, ensure the new adapter configuration for TCP/IP exactly
matches that of the old adapter. If replacing a SCSI adapter and using Y cables with external
termination, it may be possible to disconnect the SCSI adapter without affecting the remaining
cluster node. Check with your hardware vendor for proper replacement techniques if you want to
attempt replacement without shutting down the entire cluster. This may be possible in some
configurations.

Shared Disk Subsystem Replacement

With most clusters, shared disk subsystem replacement may result in the need to shut down the
cluster. Check with your manufacturer and with Microsoft Product Support Services for proper
procedures. Some replacements may not require much intervention, while others may require
adjustments to configuration. Further information on this topic is available in the Microsoft
Cluster Server Administrator's Guide and in the Microsoft Knowledge Base.

Emergency Repair Disk

The emergency repair disk (updated with Rdisk.exe) contains vital information about a particular
system that you can use to help recover a system that will not start, allowing you to restore a
backup, if necessary. It is recommended that the disk be updated when the system configuration
experiences changes. It is important to note that the cluster configuration is not stored on the
emergency repair disk. The service and driver information for the Cluster Service is stored in the
system registry. However, cluster resource and group configuration is stored in a separate
registry hive and may be restored from a recent system backup. NTBACKUP will backup this
hive when backing up registry files (if selected). Other backup software may or may not include
the cluster hive. The file associated with the cluster hive is CLUSDB and is stored with the other
cluster files (usually in c:\winnt\cluster). Be sure to check system backups to ensure this hive is
included.

System Backups and Recovery

The configuration for cluster resources and groups is stored in the cluster registry hive. This
registry hive may be backed up and restored with NTBackup. Some third-party backup software
may not include this registry hive when backing up system registry files. It is important, if you
rely on a third-party backup solution, that you verify your ability to back up and restore this hive.
The registry file for the cluster hive may be found in the directory where the cluster software was
installed — not on the quorum disk.

As most backup software (at the time of this writing) is not cluster-aware, it may be important to
establish a network path to shared data for use in system backups. For example, if you use a local
path to the data (example: G:\), and if the node loses ownership of the drive, the backup
operation may fail because it cannot reach the data using the local device path. However, if you
create a cluster-available share to the disk structure, and map a drive letter to it, the connection
may be re-established if ownership of the actual disk changes. Although the ultimate solution
would be a fully cluster-aware backup utility, this technique may be a better alternative until
such a utility is available.

What not to do on a cluster server

Below is a list of things not to do with a cluster. While there may be more items that may cause
problems, these items are definate words of warning. Article numbers for related Microsoft
Knowledge Base articles are noted where applicable.

 Do not create software fault tolerant sets with


shared disks as members. (171052)
 Do not add resources to the cluster group.
(168948)
 Do not install MSCS when both nodes have
been online and connected to the shared
storage at the same time without MSCS
installed and running on at least one node
first.
 Do not change computer names of either
node.
 Do not use WINS static entries for cluster
nodes or cluster addresses. (217199)
 Do not configure WINS or default gateway
addresses for the private interconnect.
(193890)
 Do not attempt to configure cluster resources
to use unsupported network protocols or
related network services (IPX, Netbeui, DLC,
Appletalk, Services for Macintosh, etc.)
Microsoft Cluster Server works only with the
TCP/IP protocol.
 Do not delete the
HKEY_LOCAL_MACHINE \System \Disk
registry key while the cluster is running, or if
you are using local software fault tolerance.

Top Of Page

Appendix A: MSCS Event messages

Event ID 1000

Source: ClusSvc
Microsoft Cluster Server suffered an unexpected fatal error at line ### of source
Description:
module %path%. The error code was 1006.
Problem: Messages similar to this may occur in the event of a fatal error that may cause the
Cluster Service to terminate on the node that experienced the error.
Check the system event log and the cluster diagnostic logfile for additional
information. It is possible that the cluster service may restart itself after the error.
Solution:
This event message may indicate serious problems that may be related to hardware
or other causes.

Event ID 1002

Source: ClusSvc
Microsoft Cluster Server handled an unexpected error at line 528 of source module
Description:
G:\Nt\Private\Cluster\Resmon\Rmapi.c. The error code was 5007.
Messages similar to this may occur after installation of Microsoft Cluster Server. If
the cluster service starts and successfully forms or joins the cluster, they may be
Problem:
ignored. Otherwise, these errors may indicate a corrupt quorum logfile or other
problem.
Ignore the error if the cluster appears to be working properly. Otherwise, you may
want to try creating a new quorum logfile using the -noquorumlogging or -
Solution:
fixquorum parameters as documented in the Microsoft Cluster Server
Administrator's Guide.

Event ID 1006

Source: ClusSvc
Microsoft Cluster Server was halted because of a cluster membership or
Description:
communications error. The error code was 4.
An error may have occurred between communicating cluster nodes that affected
Problem: cluster membership. This error may occur if nodes lose the ability to communicate
with each other.
Check network adapters and connections between nodes. Check the system event
Solution: log for errors. There may be a network problem preventing reliable communication
between cluster nodes.

Event ID 1007

Source: ClusSvc
Description: A new node, "ComputerName", has been added to the cluster.
The Microsoft Cluster Server Setup program ran on an adjacent computer. The
Information: setup process completed, and the node was admitted for cluster membership. No
action required.

Event ID 1009

Source: ClusSvc
Microsoft Cluster Server could not join an existing cluster and could not form a new
Description:
cluster. Microsoft Cluster Server has terminated.
The cluster service started and attempted to join a cluster. The node may not be a
member of an existing cluster because of eviction by an administrator. After a
cluster node has been evicted from the cluster, the cluster software must be removed
Problem:
and reinstalled if you want it to rejoin the cluster. And, because a cluster already
exists with the same cluster name, the node could not form a new cluster with the
same name.
Remove MSCS from the affected node, and reinstall MSCS on that system if
Solution:
desired.

Event ID 1010

Source: ClusSvc
Microsoft Cluster Server is shutting down because the current node is not a member
Description: of any cluster. Microsoft Cluster Server must be reinstalled to make this node a
member of a cluster.
The cluster service attempted to run but found that it is not a member of an existing
cluster. This may be due to eviction by an administrator or incomplete attempt to
Problem:
join a cluster. This error indicates a need to remove and reinstall the cluster
software.
Solution: Remove MSCS from the affected node, and reinstall MSCS on that server if desired.

Event ID 1011

Source: ClusSvc
Description: Cluster Node "ComputerName" has been evicted from the cluster.
Information: A cluster administrator evicted the specified node from the cluster.

Event ID 1012

Source: ClusSvc
Microsoft Cluster Server did not start because the current version of Windows NT is
Description: not correct. Microsoft Cluster Server runs only on Windows NT Server, Enterprise
Edition.
The cluster node must be running the Enterprise Edition version of Windows NT
Server, and must have Service Pack 3 or later installed. This error may occur if you
Information:
force an upgrade using the installation disks, which effectively removes any service
packs installed.

Event ID 1015

Source: ClusSvc
No checkpoint record was found in the logfile W:\Mscs\Quolog.log; the checkpoint
Description:
file is invalid or was deleted.
The Cluster Service experienced difficulty reading data from the quorum logfile.
Problem:
The logfile could be corrupted.
If the Cluster Service fails to start because of this problem, try manually starting the
cluster service with the -noquorumlogging parameter. If you need to adjust the
Solution: quorum disk designation, use the -fixquorum startup parameter when starting the
cluster service. Both of these parameters are covered in the MSCS Administrator's
Guide.

Event ID 1016

Source: ClusSvc
Microsoft Cluster Server failed to obtain a checkpoint from the cluster database for
Description:
log file W:\Mscs\Quolog.log.
The cluster service experienced difficulty establishing a checkpoint for the quorum
Problem:
logfile. The logfile could be corrupt, or there may be a disk problem.
You may need to use procedures to recover from a corrupt quorum logfile. You may
Solution:
also need to run chkdsk on the volume to ensure against file system corruption.

Event ID 1019

Source: ClusSvc
The log file D:\MSCS\Quolog.log was found to be corrupt. An attempt will be made
Description: to reset it, or you should use the Cluster Administrator utility to adjust the maximum
size.
The quorum logfile for the cluster was found to be corrupt. The system will attempt
Problem:
to resolve the problem.
The system will attempt to resolve this problem. This error may also be an
indication that the cluster property for maximum size should be increased through
Solution:
the Quorum tab. You can manually resolve this problem by using the -
noquorumlogging parameter.

Event ID 1021

Source: ClusSvc
There is insufficient disk space remaining on the quorum device. Please free up
Description: some space on the quorum device. If there is no space on the disk for the quorum log
files then changes to the cluster registry will be prevented.
Problem: Available disk space is low on the quorum disk and must be resolved.
Remove data or unnecessary files from the quorum disk so that sufficient free space
Solution: exists for the cluster to operate. If necessary, designate another disk with adequate
free space as the quorum device.

Event ID 1022

Source: ClusSvc
There is insufficient space left on the quorum device. The Microsoft Cluster Server
Description:
cannot start.
Available disk space is low on the quorum disk and is preventing the startup of the
Problem:
cluster service.
Remove data or unnecessary files from the quorum disk so that sufficient free space
exists for the cluster to operate. If necessary, use the -fixquorum startup option to
Solution:
start one node. Bring the quorum resource online and adjust free space or designate
another disk with adequate free space as the quorum device.

Event ID 1023

Source: ClusSvc
Description: The quorum resource was not found. The Microsoft Cluster Server has terminated.
The device designated as the quorum resource could not be found. This could be due
Problem: to the device having failed at the hardware level, or that the disk resource
corresponding to the quorum drive letter does not match or no longer exists.
Use the -fixquorum startup option for the cluster service. Investigate and resolve the
Solution: problem with the quorum disk. If necessary, designate another disk as the quorum
device and restart the cluster service before starting other nodes.

Event ID 1024

Source: ClusSvc
The registry checkpoint for cluster resource "resourcename" could not be restored to
Description: registry key registrykeyname. The resource may not function correctly. Make sure
that no other processes have open handles to registry keys in this registry subkey.
The registry key checkpoint imposed by the cluster service failed because an
Problem:
application or process has an open handle to the registry key or subkey.
Close any applications that may have an open handle to the registry key so that it
Solution: may be replicated as configured with the resource properties. If necessary, contact
the application vendor about this problem.

Event ID 1034

Source: ClusSvc
The disk associated with cluster disk resource resource name could not be found.
The expected signature of the disk was signature. If the disk was removed from the
cluster, the resource should be deleted. If the disk was replaced, the resource must
Description:
be deleted and created again to bring the disk online. If the disk has not been
removed or replaced, it may be inaccessible at this time because it is reserved by
another cluster node.
The cluster service attempted to mount a physical disk resource in the cluster. The
cluster disk driver could not locate a disk with this signature. The disk may be
Problem: offline or may have failed. This error may also occur if the drive has been replaced
or reformatted. This error may also occur if another system continues to hold a
reservation for the disk.
Solution: Determine why the disk is offline or non-operational. Check cables, termination, and
power for the device. If the drive has failed, replace the drive and restore the
resource to the same group as the old drive. Remove the old resource. Restore data
from a backup and adjust resource dependencies within the group to point to the new
disk resource.

Event ID 1035

Source: ClusSvc
Description: Cluster disk resource %1 could not be mounted.
The cluster service attempted to mount a disk resource in the cluster and could not
Problem: complete the operation. This could be due to a file system problem, hardware issue,
or drive letter conflict.
Check for drive letter conflicts, evidence of file system issues in the system event
Solution:
log, and for hardware problems.

Event ID 1036

Source: ClusSvc
Description: Cluster disk resource "resourcename" did not respond to a SCSI inquiry command.
The disk did not respond to the issued SCSI command. This usually indicates a
Problem:
hardware problem.
Check SCSI bus configuration. Check the configuration of SCSI adapters and
Solution:
devices. This may indicate a misconfigured or a failing device.

Event ID 1037

Source: ClusSvc
Cluster disk resource %1 has failed a filesystem check. Please check your disk
Description:
configuration.
The cluster service attempted to mount a disk resource in the cluster. A filesystem
Problem:
check was necessary and failed during the process.
Check cables, termination, and device configuration. If the drive has failed, replace
Solution: the drive and restore data. This may also indicate a need to reformat the partition and
restore data from a current backup.

Event ID 1038

Source: ClusSvc
Reservation of cluster disk "Disk W:" has been lost. Please check your system and
Description:
disk configuration.
The cluster service had exclusive use of the disk, and lost the reservation of the
Problem:
device on the shared SCSI bus.
Solution: The disk may have gone offline or failed. Another node may have taken control of
the disk or a SCSI bus reset command was issued on the bus that caused a loss of
reservation.

Event ID 1040

Source: ClusSvc
Description: Cluster generic service "ServiceName" could not be found.
The cluster service attempted to bring the specified generic service resource online.
Problem:
The service could not be located and could not be managed by the Cluster Service.
Remove the generic service resource if this service is no longer installed. The
Solution: parameters for the resource may be invalid. Check the generic service resource
properties and confirm correct configuration.

Event ID 1041

Source: ClusSvc
Description: Cluster generic service "ServiceName" could not be started.
The cluster service attempted to bring the specified generic service resource online.
Problem:
The service could not be started at the operating system level.
Remove the generic service resource if this service is no longer installed. The
parameters for the resource may be invalid. Check the generic service resource
Solution: properties and confirm correct configuration. Check to make sure the service
account has not expired, that it has the correct password, and has necessary rights for
the service to start. Check the system event log for any related errors.

Event ID 1042

Source: ClusSvc
Description: Cluster generic service "resourcename" failed.
Problem: The service associated with the mentioned generic service resource failed.
Check the generic service properties and service configuration for errors. Check
Solution:
system and application event logs for errors.

Event ID 1043

Source: ClusSvc
Description: The NetBIOS interface for "IP Address" resource has failed.
The network adapter for the specified IP address resource has experienced a failure.
Problem: As a result, the IP address is either offline, or the group has moved to a surviving
node in the cluster.
Check the network adapter and network connection for problems. Resolve the
Solution:
network-related problem.

Event ID 1044
Source: ClusSvc
Description: Cluster IP Address resource %1 could not create the required NetBios interface.
The cluster service attempted to initialize an IP Address resource and could not
Problem:
establish a context with NetBios.
This could be a network adapter or network adapter driver related issue. Make sure
the adapter is using a current driver and the correct driver for the adapter. If this is
an embedded adapter, check with the OEM to determine if a specific OEM version
of the driver is a requirement. If you already have many IP Address resources
Solution:
defined, make sure you have not reached the NetBios limit of 64 addresses. If you
have IP Address resources defined that do not have a need for NetBios affiliation,
use the IP Address private property to disable NetBios for the address. This option is
available in SP4 and helps to conserve NetBios address slots.

Event ID 1045

Source: ClusSvc
Description: Cluster IP address "IP address" could not create the required TCP/IP Interface..
The cluster service tried to bring an IP address online. The resource properties may
specify an invalid network or malfunctioning adapter. This error may occur if you
Problem: replace a network adapter with a different model and continue to use the old or
inappropriate driver. As a result, the IP address resource cannot be bound to the
specified network.
Resolve the network adapter problem or change the properties of the IP address
Solution:
resource to reflect the proper network for the resource.

Event ID 1046

Source: ClusSvc
Cluster IP Address resource %1 cannot be brought online because the subnet mask
Description:
parameter is invalid. Please check your network configuration.
The cluster service tried to bring an IP address resource online but could not do so.
Problem:
The subnet mask for the resource is either blank or otherwise invalid.
Solution: Correct the subnet mask for the resource.

Event ID 1047

Source: ClusSvc
Cluster IP Address resource %1 cannot be brought online because the IP address
Description:
parameter is invalid. Please check your network configuration.
The cluster service tried to bring an IP address resource online but could not do so.
Problem: The IP address property contains an invalid value. This may be caused by
incorrectly creating the resource through an API or the command line interface.
Solution: Correct the IP address properties for the resource.

Event ID 1048
Source: ClusSvc
Cluster IP address, "IP address," cannot be brought online because the specified
Description:
adapter name is invalid.
The cluster service tried to bring an IP address online. The resource properties may
specify an invalid network or a malfunctioning adapter. This error may occur if you
Problem:
replace a network adapter with a different model. As a result, the IP address resource
cannot be bound to the specified network.
Resolve the network adapter problem or change the properties of the IP address
Solution:
resource to reflect the proper network for the resource.

Event ID 1049

Source: ClusSvc
Cluster IP address "IP address" cannot be brought online because the address IP
Description:
address is already present on the network. Please check your network configuration.
The cluster service tried to bring an IP address online. The address is already in use
Problem: on the network and cannot be registered. Therefore, the resource cannot be brought
online.
Solution: Resolve the IP address conflict, or choose another address for the resource.

Event ID 1050

Source: ClusSvc
Cluster Network Name resource %1 cannot be brought online because the name %2
Description:
is already present on the network. Please check your network configuration.
The cluster service tried to bring a Network Name resource online. The name is
Problem: already in use on the network and cannot be registered. Therefore, the resource
cannot be brought online.
Solution: Resolve the conflict, or choose another network name.

Event ID 1051

Source: ClusSvc
Cluster Network Name resource "resourcename" cannot be brought online because
Description:
it does not depend on an IP address resource. Please add an IP address dependency.
The cluster service attempted to bring the network name resource online, and found
Problem:
that a required dependency was missing.
Microsoft Cluster Server requires an IP address dependency for network name
resource types. Cluster Administrator presents a pop-up message if you attempt to
remove this dependency without specifying another like dependency. To resolve this
Solution:
error, replace the IP address dependency for this resource. Because it is difficult to
remove this dependency, Event 1051 may be an indication of problems within the
cluster registry. Check other resources for possible dependency problems.

Event ID 1052
Source: ClusSvc
Cluster Network Name resource "resourcename" cannot be brought online because
Description:
the name could not be added to the system.
The cluster service attempted to bring the network name resource online but the
Problem:
attempt failed.
Check the system event log for errors. Check network adapter configuration and
Solution: operation. Check TCP/IP configuration and name resolution methods. Check WINS
servers for possible database problems or invalid static mappings.

Event ID 1053

Source: ClusSvc
Cluster File Share "resourcename" cannot be brought online because the share could
Description:
not be created.
The cluster service attempted to bring the share online but the attempt to create the
Problem:
share failed.
Make sure the Server service is started and functioning properly. Check the path for
the share. Check ownership and permissions on the directory. Check the system
Solution: event log for details. Also, if diagnostic logging is enabled, check the log for an
entry related to this failure. Use the net helpmsgerrornumber command with the
error code found in the log entry.

Event ID 1054

Source: ClusSvc
Description: Cluster File Share %1 could not be found.
The share corresponding to the named File Share resource was deleted using a
Problem: mechanism other than Cluster Administrator. This may occur if you select the share
with Explorer and choose 'Not Shared'.
Delete shares or take them offline via Cluster Administrator or the command line
Solution:
program CLUSTER.EXE.

Event ID 1055

Source: ClusSvc
Description: Cluster File Share "sharename" has failed a status check.
The cluster service (through resource monitors) periodically monitors the status of
cluster resources. In this case, a file share failed a status check. This could mean that
Problem: someone attempted to delete the share through Windows NT Explorer or Server
Manager, instead of through Cluster Administrator. This event could also indicate a
problem with the Server service, or access to the shared directory.
Check the system event log for errors. Check the cluster diagnostic log (if it is
enabled) for status codes that may be related to this event. Check the resource
Solution:
properties for proper configuration. Also, make sure the file share has proper
dependencies defined for related resources.
Event ID 1056

Source: ClusSvc
The cluster database on the local node is in an invalid state. Please start another node
Description:
before starting this node.
The cluster database on the local node may be in a default state from the installation
Problem:
process and the node has not properly joined with an existing node.
Make sure another node of the same cluster is online first before starting this node.
Solution: Upon joining with another cluster node, the node will receive an updated copy of the
official cluster database and should alleviate this error.

Event ID 1057

Source: ClusSvc
Description: The cluster service CLUSDB could not be opened.
The Cluster Service tried to open the CLUSDB registry hive and could not do so. As
Problem:
a result, the cluster service cannot be brought online.
Check the cluster installation directory for the existence of a file called CLUSDB.
Solution: Make sure the registry file is not held open by any applications, and that permissions
on the file allow the cluster service access to this file and directory.

Event ID 1058

Source: ClusSvc
Description: The Cluster Resource Monitor could not load the DLL %1 for resource type %2.
The Cluster Service tried to bring a resource online that requires a specific resource
Problem: DLL for the resource type. The DLL is either missing, corrupt, or an incompatible
version. As a result, the resource cannot be brought online.
Check the cluster installation directory for the existence of the named resource DLL.
Solution:
Make sure the DLL exists in the proper directory on both nodes.

Event ID 1059

Source: ClusSvc
Description: The Cluster Resource DLL %1 for resource type %2 failed to initialize.
The Cluster Service tried to load the named resource DLL and it failed to initialize.
Problem: The DLL could be corrupt, or an incompatible version. As a result, the resource
cannot be brought online.
Check the cluster installation directory for the existence of the named resource DLL.
Make sure the DLL exists in the proper directory on both nodes and is of proper
Solution: version. If the DLL is clusres.dll, this is the default resource DLL that comes with
MSCS. Check to make sure the version/date stamp is equivalent to or with a later
date than the version contained in the service pack in use.

Event ID 1061
Source: ClusSvc
Description: Microsoft Cluster Server successfully formed a cluster on this node.
This informational message indicates that an existing cluster of the same name was
Information: not detected on the network, and that this node elected to form the cluster and own
access to the quorum disk.

Event ID 1062

Source: ClusSvc
Description: Microsoft Cluster Server successfully joined the cluster.
When the Cluster Service started, it detected an existing cluster on the network and
Information:
was able to successfully join the cluster. No action needed.

Event ID 1063

Source: ClusSvc
Description: Microsoft Cluster Server was successfully stopped.
Information: The Cluster Service was stopped manually by the administrator.

Event ID 1064

Source: ClusSvc
The quorum resource was changed. The old quorum resource could not be marked
Description: as obsolete. If there is a partition in time, you may lose changes to your database,
because the node that is down will not be able to get to the new quorum resource.
The administrator changed the quorum disk designation without all cluster nodes
Problem:
present.
When other cluster nodes attempt to join the existing cluster, they may not be able to
connect to the quorum disk, and may not participate in the cluster, because their
Solution: configuration indicates a different quorum device. For any nodes that meet this
criterion, you may need to use the -fixquorum option to start the Cluster Service on
these nodes and make configuration changes.

Event ID 1065

Source: ClusSvc
Description: Cluster resource %1 failed to come online.
The cluster service attempted to bring the resource online, but the resource could not
Problem: reach an online status. The resource may have exhausted the timeout period allotted
for the resource to reach an online state.
Solution: Check any parameters related to the resource and check the event log for details.

Event ID 1066
Source: ClusSvc
Cluster disk resource resourcename is corrupted. Running Chkdsk /F to repair
Description:
problems.
The Cluster Service detected corruption on the indicated disk resource and started
Chkdsk /f on the volume to repair the structure. The Cluster Service will
Problem:
automatically perform this operation, but only for cluster-defined disk resources (not
local disks).
Scan the event log for additional errors. The disk corruption could be indicative of
other problems. Check related hardware and devices on the shared bus and ensure
Solution:
proper cables and termination. This error may be a symptom of failing hardware or a
deteriorating drive.

Event ID 1067

Source: ClusSvc
Description: Cluster disk resource %1 has corrupt files. Running Chkdsk /F to repair problems.
The Cluster Service detected corruption on the indicated disk resource and started
Chkdsk /f on the volume to repair the structure. The Cluster Service will
Problem:
automatically perform this operation, but only for cluster-defined disk resources (not
local disks).
Scan the event log for additional errors. The disk corruption could be indicative of
other problems. Check related hardware and devices on the shared bus and ensure
Solution:
proper cables and termination. This error may be a symptom of failing hardware or a
deteriorating drive.

Event ID 1068

Source: ClusSvc
Description: The cluster file share resource resourcename failed to start. Error 5.
The file share cannot be brought online. The problem may be caused by permissions
Problem: to the directory or disk in which the directory resides. This may also be related to
permission problems within the domain.
Check to make sure that the Cluster Service account has rights to the directory to be
shared. Make sure a domain controller is accessible on the network. Make sure
Solution:
dependencies for the share and for other resource in the group are set correctly.
Error 5 translates to "Access Denied."

Event ID 1069

Source: ClusSvc
Description: Cluster resource "Disk G:" failed.
The named resource failed and the cluster service logged the event. In this example,
Problem:
a disk resource failed.
Solution: For disk resources, check the device for proper operation. Check cables, termination,
and logfiles on both cluster nodes. For other resources, check resource properties for
proper configuration, and check to make sure dependencies are configured correctly.
Check the diagnostic log (if it is enabled) for status codes corresponding to the
failure.

Event ID 1070

Source: ClusSvc
Description: Cluster node attempted to join the cluster but failed with error 5052.
The cluster node attempted to join an existing cluster but was unable to complete the
Problem: process. This problem may occur if the node was previously evicted from the
cluster.
If the node was previously evicted from the cluster, you must remove and reinstall
Solution:
MSCS on the affected server.

Event ID 1071

Source: ClusSvc
Description: Cluster node 2 attempted to join but was refused. Error 5052.
Problem: Another node attempted to join the cluster and this node refused the request.
If the node was previously evicted from the cluster, you must remove and reinstall
Solution: MSCS on the affected server. Look in Cluster Administrator to see if the other node
is listed as a possible cluster member.

Event ID 1073

Source: ClusSvc
Microsoft Cluster Server was halted to prevent an inconsistency within the cluster.
Description:
The error code was 5028.
The cluster service on the affected node was halted because of some kind of
Problem:
inconsistency between cluster nodes.
Check connectivity between systems. This error may be an indication of
Solution:
configuration or hardware problems.

Event ID 1077

Source: ClusSvc
Description: The TCP/IP interface for cluster IP address resourcename has failed.
The IP address resource depends on the proper operation of a specific network
Problem:
interface as configured in the resource properties. The network interface failed.
Check the system event log for errors. Check the network adapter for proper
Solution: operation and replace the adapter if necessary. Check to make sure the proper
adapter driver is loaded for the device and check for newer versions of the driver.

Event ID 1080
Source: ClusSvc
The Microsoft Cluster Server could not write file W:\MSCS\Chk7f5.tmp. The disk
Description:
may be low on disk space, or some other serious condition exists.
The cluster service attempted to create a temporary file in the MSCS directory on
Problem: the quorum disk. Lack of disk space or other factors prevented successful
completion of the operation.
Check the quorum drive for available disk space. The file system may be corrupted
Solution: or the device may be failing. Check file system permissions to ensure that the cluster
service account has full access to the drive and directory.

Event ID 1093

Source: ClusSvc
Node %1 is not a member of cluster %2. If the name of the node has changed,
Description:
Microsoft Cluster Server must be reinstalled.
The cluster service attempted to start but found that it was not a valid member of the
Problem:
cluster.
Microsoft Cluster Server may need to be reinstalled on this node. If this is the result
Solution: of a server name change, be sure to evict the node from the cluster (from an
operational node) prior to reinstallation.

Event ID 1096

Source: ClusSvc
Microsoft Cluster Server cannot use network adapter %1 because it does not have a
Description:
valid IP address assigned to it.
The network configuration for the adapter has changed and the cluster service
Problem:
cannot make use of the adapter for the network that was assigned to it.
Check the network configuration. If a DHCP address was used for the primary
Solution: address of the adapter, the address may have been lost. For best results, use a static
address.

Event ID 1097

Source: ClusSvc
Microsoft Cluster Server did not find any network adapters with valid IP addresses
Description:
installed in the system. The node will not be able to join a cluster.
The network configuration for the system needs to be corrected to match the same
Problem:
connected networks as the other node of the cluster.
Check the network configuration and make sure it agrees with the working node of
Solution: the cluster. Make sure the same networks are accessible from all systems in the
cluster.

Event ID 1098
Source: ClusSvc
The node is no longer attached to cluster network network_id by adapter adapter.
Description: Microsoft Cluster Server will delete network interface interface from the cluster
configuration.
The Cluster Service observed a change in network configuration that might be
Information: induced by a change of adapter type or by removal of a network. The network will
be removed from the list of available networks.

Event ID 1100

Source: ClusSvc
Microsoft Cluster Server discovered that the node is now attached to cluster network
Description: network_id by adapter adapter. A new cluster network interface will be added to the
cluster configuration.
The Cluster Service noticed a new network accessible by the cluster nodes, and has
Information:
added the new network to the list of accessible networks.

Event ID 1102

Source: ClusSvc
Microsoft Cluster Server discovered that the node is attached to a new network by
Description: adapter adapter. A new network and network interface will be added to the cluster
configuration.
The cluster service noticed the addition of a new network. The network will be
Information:
added to list of available networks.

Event ID 1104

Source: ClusSvc
Microsoft Cluster Server failed to update the configuration for one of the nodes
Description:
Network interfaces. The error code was errorcode.
The cluster service attempted to update a cluster node and could not perform the
Problem:
operation.
Use the net helpmsg errorcode command to find an explanation of the underlying
Solution: error. For example, error 1393 indicates that a corrupted disk caused the operation to
fail.

Event ID 1105

Source: ClusSvc
Microsoft Cluster Server failed to initialize the RPC services. The error code was
Description:
%1.
The cluster service attempted to utilize required RPC services and could not
Problem:
successfully perform the operation.
Use the net helpmsg errorcode command to find an explanation of the underlying
Solution: error. Check the system event log for other RPC related errors or performance
problems.

Event ID 1107

Source: ClusSvc
Cluster node node name failed to make a connection to the node over network
Description:
network name. The error code was 1715.
The cluster service attempted to connect to another cluster node over a specific
Problem:
network and could not establish a connection. This error is a warning message.
Check to make sure that the specified network is available and functioning correctly.
Solution: If the node experiences this problem, it may try other available networks to establish
the desired connection.

Event ID 1109

Source: ClusSvc
The node was unable to secure its connection to cluster node %1. The error code
Description:
was %2. Check that both nodes can communicate with their domain controllers.
The cluster service attempted to connect to another cluster node and could not
Problem:
establish a secure connection. This could indicate domain connectivity problems.
Check to make sure that the networks are available and functioning correctly. This
Solution:
may be a symptom of larger network problems or domain security issues.

Event ID 1115

Source: ClusSvc
An unrecoverable error caused the join of node nodename to the cluster to be
Description:
aborted. The error code was errorcode.
A node attempted to join the cluster but was unable to obtain successful
Problem:
membership.
Use the NET HELPMSG errorcode command to obtain further description of the
error that prevented the join operation. For example, error code 1393 indicates that a
Solution:
disk structure is corrupted and nonreadable. An error code like this could indicate a
corrupted quorum disk.

Related Event Messages

Event ID 9

Source: Disk
Description: The device, \Device\ScsiPort2, did not respond within the timeout period.
An I/O request was sent to a SCSI device and was not serviced within acceptable
Problem:
time. The device timeout was logged by this event.
You may have a device or controller problem. Check SCSI cables, termination, and
adapter configuration. Excessive recurrence of this event message may indicate a
Solution:
serious problem that could indicate potential for data loss or corruption. If necessary,
contact your hardware vendor for help troubleshooting this problem.

Event ID 101

Source: W3SVC
The server was unable to add the virtual root "/" for the directory "path" because of
Description:
the following error: The system cannot find the path specified. The data is the error.
The World Wide Web Publishing service could not create a virtual root for the IIS
Problem:
Virtual Root resource. The directory path may have been deleted.
Re-create or restore the directory and contents. Check the resource properties for the
IIS Virtual Root resource and ensure that the path is correct. This problem may
Solution: occur if you had an IIS Virtual Root resource defined and then uninstalled Microsoft
Cluster Server without first deleting the resource. In this case, you may evaluate and
change virtual root properties by using the Internet Service Manager.

Event ID 1004

Source: DHCP
DHCP IP address lease "IP address" for the card with network address "media
Description:
access control Address" has been denied.
This system uses a DHCP-assigned IP address for a network adapter. The system
attempted to renew the leased address and the DHCP server denied the request. The
Problem:
address may already be allocated to another system. The DHCP server may also
have a problem. Network connectivity may be affected by this problem.
Resolve the problem by correcting DHCP server problems or assigning a static IP
Solution:
address. For best results within a cluster, use statically assigned IP addresses.

Event ID 1005

Source: DHCP
DHCP failed to renew a lease for the card with network address "MAC Address."
Description:
The following error occurred: The semaphore timeout period has expired.
This system uses a DHCP assigned IP address for a network adapter. The system
Problem: attempted to renew the leased address and was unable to renew the lease. Network
operations on this system may be affected.
There may be a connectivity problem preventing access to the DHCP server that
Solution: leased the address, or the DHCP server may be offline. For best results within a
cluster, use statically assigned IP addresses.

Event ID 2511

Source: Server
The server service was unable to recreate the share "Sharename" because the
Description:
directory "path" no longer exists.
The Server service attempted to create a share using the specified directory path.
This problem may occur if you create a share (outside of Cluster Administrator) on a
Problem: cluster shared device. If the device is not exclusively available to this computer, the
server service cannot create the share. Also, the directory may no longer exist or
there may be RPC related issues.
Correct the problem by creating a shared resource through Cluster Administrator, or
correct the problem with the missing directory. Check dates of RPC files in the
Solution:
system32 directory. Make sure they concur with those contained in the service pack
in use, or any hotfixes applied.

Event ID 4199

Source: TCPIP
The system detected an address conflict for IP address "IP address" with the system
Description: having network hardware address "media access control address." Network
operations on this system may be disrupted as a result.
Another system on the network may be using one of the addresses configured on this
Problem:
computer.
Resolve the IP address conflict. Check network adapter configuration and any IP
Solution:
address resources defined within the cluster.

Event ID 5719

Source: Netlogon
No Windows NT Domain controller is available for domain "domain." (This event is
expected and can be ignored when booting with the "No Net" hardware profile.) The
Description:
following error occurred: There are currently no logon servers available to service
the logon request.
A domain controller for the domain could not be contacted. As a result, proper
Problem: authentication of accounts could not be completed. This may occur if the network is
disconnected or disabled through system configuration.
Solution: Resolve the connectivity problem with the domain controller and restart the system.

Event ID 7000

Source: Service Control Manager


The Cluster Service failed to start because of the following error: The service did not
Description:
start because of a logon failure.
The service control manager attempted to start a service (possibly ClusSvc). It could
Problem:
not authenticate the service account. This error may be seen with Event 7013.
Solution: The service account could not be authenticated. This may be because of a failure
contacting a domain controller, or because account credentials are invalid. Check the
service account name and password and ensure that the account is available and that
credentials are correct. You may also try running the cluster service from a
command prompt (if currently logged on as an administrator) by changing to the
%systemroot%\Cluster directory (or where you installed the software) and typing
ClusSvc -debug. If the service starts and runs correctly, stop it by pressing CTRL+C
and troubleshoot the service account problem. This error may also occur if network
connectivity is disabled through the system configuration or hardware profile.
Microsoft Cluster Server requires network connectivity.

Event ID 7013

Source: Service Control Manager


Logon attempt with current password failed with the following error: There are
Description:
currently no logon servers available to service the logon request.
The description for this error message may vary somewhat based on the actual error.
More Info: For example, another error that may be listed in the event detail might be: "Logon
Failure: unknown username or bad password."
The service control manager attempted to start a service (possibly ClusSvc). It could
Problem:
not authenticate the service account with a domain controller.
The service account may be in another domain, or this system is not a domain
controller. It is acceptable for the node to be a non-domain controller, but the node
needs access to a domain controller within the domain as well as the domain that the
service account belongs to. Inability to contact the domain controller may be
Solution: because of a problem with the server, network, or other factors. This problem is not
related to the cluster software and must be resolved before you start the cluster
software. This error may also occur if network connectivity is disabled through the
system configuration or hardware profile. Microsoft Cluster Server requires network
connectivity.

Event ID 7023

Source: Service Control Manager


The Cluster Server service terminated with the following error: The quorum log
Description:
could not be created or mounted successfully.
The Cluster Service attempted to start but could not gain access to the quorum log
Problem: on the quorum disk. This may be because of problems gaining access to the disk or
problems joining a cluster that has already formed.
Check the disk and quorum log for problems. If necessary, check the cluster logfile
Solution: for more information. There may be other events in the system event log that may
give more information.
Top Of Page

Appendix B: Using AND Reading THE Cluster Logfile

CLUSTERLOG Environment Variable


If you set the CLUSTERLOG environment variable, the cluster will create a logfile that contains
diagnostic information using the path specified. Important events during the operation of the
Cluster Service will be logged in this file. Because so many different events occur, the logfile
may be somewhat cryptic or hard to read. This document gives some hints about how to read the
logfile and information about what items to look for.

Note: Each time you attempt to start the Cluster Service, the log will be cleared and a new
logfile started. Each component of MSCS that places an entry in the logfile will indicate itself by
abbreviation in square brackets. For example, the Node Manager component would be
abbreviated [NM]. Logfile entries will vary from one cluster to another. As a result, other
logfiles may vary from excerpts referenced in this document.

Note: Log entry lines in the following sections have been wrapped for space constraints in this
document. The lines do not normally wrap.

Operating System Version Number and Service Pack Level

Near the beginning of the logfile, notice the build number of MSCS, followed by the operating
system version number and service pack level. If you call for support, engineers may ask for this
information:

082::14-21:29:26.625 Cluster Service started - Cluster Version 1.224.


082::14-21:29:26.625 OS Version 4.0.1381 - Service Pack 3.

Cluster Service Startup

Following the version information, some initialization steps occur. Those steps are followed by
an attempt to join the cluster, if one node already exists in a running state. If the Cluster Service
could not detect any other cluster members, it will attempt to form the cluster. Consider the
following log entries:

0b5::12-20:15:23.531 We're initing Ep...


0b5::12-20:15:23.531 [DM]: Initialization
0b5::12-20:15:23.531 [DM] DmpRestartFlusher: Entry
0b5::12-20:15:23.531 [DM] DmpStartFlusher: Entry
0b5::12-20:15:23.531 [DM] DmpStartFlusher: thread created
0b5::12-20:15:23.531 [NMINIT] Initializing the Node Manager...
0b5::12-20:15:23.546 [NMINIT] Local node name = NODEA.
0b5::12-20:15:23.546 [NMINIT] Local node ID = 1.
0b5::12-20:15:23.546 [NM] Creating object for node 1 (NODEA)
0b5::12-20:15:23.546 [NM] node 1 state 1
0b5::12-20:15:23.546 [NM] Initializing networks.
0b5::12-20:15:23.546 [NM] Initializing network interface facilities.
0b5::12-20:15:23.546 [NMINIT] Initialization complete.
0b5::12-20:15:23.546 [FM] Starting worker thread...
0b5::12-20:15:23.546 [API] Initializing
0a9::12-20:15:23.546 [FM] Worker thread running
0b5::12-20:15:23.546 [lm] :LmInitialize Entry.
0b5::12-20:15:23.546 [lm] :TimerActInitialize Entry.
0b5::12-20:15:23.546 [CS] Initializing RPC server.
0b5::12-20:15:23.609 [INIT] Attempting to join cluster MDLCLUSTER
0b5::12-20:15:23.609 [JOIN] Spawning thread to connect to sponsor
192.88.80.114
06c::12-20:15:23.609 [JOIN] Asking 192.88.80.114 to sponsor us.
0b5::12-20:15:23.609 [JOIN] Waiting for all connect threads to terminate.
06c::12-20:15:32.750 [JOIN] Sponsor 192.88.80.114 is not available,
status=1722.
0b5::12-20:15:32.750 [JOIN] All connect threads have terminated.
0b5::12-20:15:32.750 [JOIN] Unable to connect to any sponsor node.
0b5::12-20:15:32.750 [INIT] Failed to join cluster, status 53
0b5::12-20:15:32.750 [INIT] Attempting to form cluster MDLCLUSTER
0b5::12-20:15:32.750 [Ep]: EpInitPhase1
0b5::12-20:15:32.750 [API] Online read only
04b::12-20:15:32.765 [RM] Main: Initializing.

Note that the cluster service attempts to join the cluster. If it cannot connect with an existing
member, the software decides to form the cluster. The next series of steps attempts to form
groups and resources necessary to accomplish this task. It is important to note that the cluster
service must arbitrate control of the quorum disk.

0b5::12-20:15:32.781 [FM] Creating group a1a13a86-0eaf-11d1


-8427-0000f8034599
0b5::12-20:15:32.781 [FM] Group a1a13a86-0eaf-11
d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 [FM] Creating resource a1a13a87-0eaf-
11d1-8427-0000f8034599
0b5::12-20:15:32.781 [FM] FmpAddPossibleEntry adding
1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list
0b5::12-20:15:32.781 [FMX] Found the quorum
resource a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 [FM] All dependencies for a
1a13a87-0eaf-11d1-8427-0000f8034599 created
0b5::12-20:15:32.781 [FM] arbitrate for quorum
resource id a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 FmpRmCreateResource:
creating resource a1a13a87-0eaf-11d1-8427-0000f8034599
in shared resource monitor
0b5::12-20:15:32.812 FmpRmCreateResource:
created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 1363016
0dc::12-20:15:32.828 Physical Disk <Disk D:>: Arbitrate returned status 0.
0b5::12-20:15:32.828 [FM] FmGetQuorumResource successful
0b5::12-20:15:32.828 FmpRmOnlineResource:
bringing resource a1a13a87-0eaf-11d1-8427-0000f8034599
(resid 1363016) online.
0b5::12-20:15:32.843 [CP] CppResourceNotify for resource Disk D:
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting
type 0 context 8
0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 0
0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 1
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 388
type 0 context 8
0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership
0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 388
type 0 context 8
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting
type 0 context 9
0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock
wait on Type 0
0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful,
lock granted to 1
0b5::12-20:15:32.843 [GUM] GumSendUpdate:
Locker dispatching seq 389
type 0 context 9
0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate
releasing lock ownership
0b5::12-20:15:32.843 [GUM] GumSendUpdate:
completed update seq 389
type 0 context 9
0b5::12-20:15:32.843 FmpRmOnlineResource:
Resource a1a13a87-0eaf-11d1-8427-0000f8034599 pending
0e1::12-20:15:33.359 Physical Disk <Disk D:>: Online,
created registry watcher thread.
090::12-20:15:33.359 [FM] NotifyCallBackRoutine: enqueuing event
04d::12-20:15:33.359 [FM] WorkerThread,
processing transition event for a1a13a87-0eaf-11
d1-8427-0000f8034599, oldState = 129, newState = 2.
04d::12-20:15:33.359 [FM] HandleResourceTransition:
Resource Name = a1a13a87-0eaf-11d1-8427-0000f8034599
old state=129 new state=2
04d::12-20:15:33.359 [DM] DmpQuoObjNotifyCb:
Quorum resource is online
04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb:
Own quorum resource, try open the quorum log
04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb:
the name of the quorum file is D:\MSCS\quolog.log
04d::12-20:15:33.375 [lm] LogCreate :
Entry FileName=D:\MSCS\quolog.log MaxFileSize=
0x00010000
04d::12-20:15:33.375 [lm] LogpCreate : Entry

In this case, the node forms the cluster group and quorum disk resource, gains control of the disk,
and opens the quorum logfile. From here, the cluster performs operations with the logfile, and
proceeds to form the cluster. This involves configuring network interfaces and bringing them
online.

0b5::12-20:15:33.718 [NM] Beginning form process.


0b5::12-20:15:33.718 [NM] Synchronizing node information.
0b5::12-20:15:33.718 [NM] Creating node objects.
0b5::12-20:15:33.718 [NM] Configuring networks & interfaces.
0b5::12-20:15:33.718 [NM] Synchronizing network information.
0b5::12-20:15:33.718 [NM] Synchronizing interface information.
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Exit,
pLocalXsaction=0x00151c20 dwError=0x00000000
0b5::12-20:15:33.718 [NM] Setting database
entry for interface a1a13a7f-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Entry
0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Exit,
dwError=0x00000000
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry
0b5::12-20:15:33.875 [dm] DmBeginLocalUpdate Exit,
pLocalXsaction=0x00151c20 dwError=0x00000000
0b5::12-20:15:33.875 [NM] Setting database entry
for interface a1a13a81-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Entry
0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Exit,
dwError=0x00000000
0b5::12-20:15:33.875 [NM] Matched 2 networks,
created 0 new networks.
0b5::12-20:15:33.875 [NM] Resynchronizing network information.
0b5::12-20:15:33.875 [NM] Resynchronizing interface information.
0b5::12-20:15:33.875 [NM] Creating network objects.
0b5::12-20:15:33.875 [NM]
Creating object for network a1a13a7e-0eaf-11d1-
8427-0000f8034599
0b5::12-20:15:33.875 [NM]
Creating object for network a1a13a80-0eaf-11d1-
8427-0000f8034599
0b5::12-20:15:33.875 [NM] Creating interface objects.
0b5::12-20:15:33.875 [NM]
Creating object for interface a1a13a7f-0eaf-11d1-8427-
0000f8034599.
0b5::12-20:15:33.875 [NM]
Registering network a1a13a7e-0eaf-11d1-8427-
0000f8034599 with
cluster transport.
0b5::12-20:15:33.875 [NM]
Registering interfaces for network a1a13a7e-0eaf-11d1-8427-
0000f8034599 with cluster transport.
0b5::12-20:15:33.875 [NM]
Registering interface a1a13a7f-0eaf-
11d1-8427-0000f8034599 with cluster transport,
addr 9.9.9.2, endpoint 3003.
0b5::12-20:15:33.890 [NM]
Instructing cluster transport to bring network a1a13a7e-0eaf-11d1-
8427-0000f8034599 online.
0b5::12-20:15:33.890 [NM]
Creating object for interface a1a13a81-0eaf-11d1-
8427-0000f8034599.
0b5::12-20:15:33.890 [NM]
Registering network a1a13a80-0eaf-11d1-8427-
0000f8034599
with cluster transport.
0b5::12-20:15:33.890 [NM]
Registering interfaces for network a1a13a80-0eaf-11d1-8427-
0000f8034599
with cluster transport.
0b5::12-20:15:33.890 [NM]
Registering interface a1a13a81-0eaf-11d1-8427-
0000f8034599
with cluster transport, addr 192.88.80.190, endpoint 3003.
0b5::12-20:15:33.890 [NM]
Instructing cluster transport to bring network a1a13a80-0eaf-11d1-
8427-0000f8034599 online.
After initializing network interfaces, the cluster will continue formation with the enumeration of
cluster nodes. In this case, as a newly formed cluster, the cluster will contain only one node. If
this session had been joining an existing cluster, the node enumeration would show two nodes.
Next, the cluster will bring the Cluster IP address and Cluster Name resources online.

0b5::12-20:15:34.015 [FM] OnlineGroup:


setting group state to Online for f901aa29-0eaf-11d1-
8427-0000f8034599
069::12-20:15:34.015 IP address <
Cluster IP address>: Created NBT interface \Device\NetBt_
If6 (instance 355833456).
0b5::12-20:15:34.015 [FM]
FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427
-0000f8034599 possible node list
0b5::12-20:15:34.015 [FM]
FmFormNewClusterPhase2 complete.
.
.
.
0b5::12-20:15:34.281 [INIT] Successfully formed a cluster.
09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Entry.
09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Exit gdwNumHandles=3
0b5::12-20:15:34.281 [INIT] Cluster Started! Original Min WS is
204800, Max WS is 1413120.
08c::12-20:15:34.296 [CPROXY] clussvc initialized
069::12-20:15:40.421 IP address <Cluster IP Address>:
IP Address 192.88.80.114 on adapter DC21X41 online
.
.
.
04d::12-20:15:40.421 [FM] OnlineWaitingTree,
a1a13a84-0eaf-11d1-8427-0000f8034599
depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Start first
04d::12-20:15:40.421 [FM] OnlineWaitingTree,
Start resource a1a13a84-0eaf-11d1-8427-0000f8034599
04d::12-20:15:40.421 [FM] OnlineResource:
a1a13a84-0eaf-11d1-8427-0000f8034599
depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Bring online first.
04d::12-20:15:40.421 FmpRmOnlineResource:
bringing resource a1a13a84-0eaf-11d1-8427-0000f8034599
(resid 1391032) online.
04d::12-20:15:40.421 [CP] CppResourceNotify for resource Cluster Name
04d::12-20:15:40.421 [GUM] GumSendUpdate: Locker waiting
type 0 context 8
04d::12-20:15:40.437 [GUM] Thread 0x4d UpdateLock wait on Type 0
04d::12-20:15:40.437 [GUM] DoLockingUpdate successful, lock granted to 1
076::12-20:15:40.437 Network Name <Cluster Name>:
Bringing resource online...
04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker dispatching seq 411
type 0 context 8
04d::12-20:15:40.437 [GUM] GumpDoUnlockingUpdate
releasing lock ownership
04d::12-20:15:40.437 [GUM] GumSendUpdate: completed update seq 411
type 0 context 8
04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker waiting
type 0 context 11
.
.
.
076::12-20:15:43.515 Network Name <Cluster Name>:
Registered server name MDLCLUSTER on transport \Device\NetBt_If6.
076::12-20:15:46.578 Network Name <Cluster Name>:
Registered workstation name MDLCLUSTER on transport \Device\NetBt_If6.
076::12-20:15:46.578 Network Name <Cluster Name>:
Network Name MDLCLUSTER is now online

Following these steps, the cluster will attempt to bring other resources and groups online. The
logfile will continue to increase in size as the cluster service runs. Therefore, it may be a good
idea to enable this option when you are having problems, rather than leaving it on for days or
weeks at a time.

Logfile Entries for Common Failures

After reviewing a successful startup of the Cluster Service, you may want to examine some
errors that may appear because of various failures. The following examples illustrate possible log
entries for four different failures.

Example 1: Quorum Disk Turned Off

If the cluster attempts to form and cannot connect to the quorum disk, entries similar to the
following may appear in the logfile. Because of the failure, the cluster cannot form, and the
Cluster Service terminates.

0b9::14-20:59:42.921 [RM] Main: Initializing.


08f::14-20:59:42.937 [FM]
Creating group a1a13a86-0eaf-11d1-8427-
0000f8034599
08f::14-20:59:42.937 [FM]
Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains
a1a13a87-0eaf-11d1-8427-0000f8034599.
08f::14-20:59:42.937 [FM]
Creating resource a1a13a87-0eaf-11d1-8427-
0000f8034599
08f::14-20:59:42.937 [FM]
FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-
0000f8034599 possible node list
08f::14-20:59:42.937 [FMX]
Found the quorum resource a1a13a87-0eaf-11d1-8427-
0000f8034599.
08f::14-20:59:42.937 [FM]
All dependencies for a1a13a87-0eaf-11d1-8427-
0000f8034599 created
08f::14-20:59:42.937 [FM]
arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-
0000f8034599.
08f::14-20:59:42.937
FmpRmCreateResource:
creating resource a1a13a87-0eaf-11d1-8427-
0000f8034599 in
shared resource monitor
08f::14-20:59:42.968 FmpRmCreateResource:
created resource a1a13a87-0eaf-11d1-8427-
0000f8034599,
resid 1362616
0e9::14-20:59:43.765 Physical Disk <Disk D:>:
SCSI, error reserving disk, error 21.
0e9::14-20:59:54.125 Physical Disk <Disk D:>:
SCSI, error reserving disk, error 21.
0e9::14-20:59:54.140 Physical Disk <Disk D:>:
Arbitrate returned status 21.
08f::14-20:59:54.140 [FM] FmGetQuorumResource
failed, error 21.
08f::14-20:59:54.140 [INIT] Cleaning up failed form attempt.
08f::14-20:59:54.140 [INIT] Failed to form cluster,
status 3213068.
08f::14-20:59:54.140 [CS] ClusterInitialize failed 21
08f::14-20:59:54.140 [INIT] The cluster service is shutting down.
08f::14-20:59:54.140 [evt] EvShutdown
08f::14-20:59:54.140 [FM] Shutdown:
Failover Manager requested to shutdown groups.
08f::14-20:59:54.140 [FM] DestroyGroup:
destroying a1a13a86-0eaf-11d1-8427-0000f8034599
08f::14-20:59:54.140 [FM] DestroyResource:
destroying a1a13a87-0eaf-11d1-8427-0000f8034599
08f::14-20:59:54.140 [OM] Deleting object Physical Disk
08f::14-20:59:54.140 [FM]
Resource a1a13a87-0eaf-11d1-8427-0000f8034599 destroyed.
08f::14-20:59:54.140 [FM]
Group a1a13a86-0eaf-11d1-8427-0000f8034599 destroyed.
08f::14-20:59:54.140 [Dm] DmShutdown
08f::14-20:59:54.140 [DM] DmpShutdownFlusher: Entry
08f::14-20:59:54.156 [DM] DmpShutdownFlusher: Setting event
062::14-20:59:54.156 [DM] DmpRegistryFlusher: got 0
062::14-20:59:54.156 [DM] DmpRegistryFlusher: exiting
0ca::14-20:59:54.156 [FM] WorkItem, delete resource
<Disk D:> status 0
0ca::14-20:59:54.156 [OM]
Deleting object Disk Group 1 (a1a13a86-0eaf-11d1-
8427-0000f8034599)
0e7::14-20:59:54.375 [CPROXY] clussvc terminated, error 0.
0e7::14-20:59:54.375 [CPROXY] Service Stopping...
0b9::14-20:59:54.375 [RM] Going away, Status = 1, Shutdown = 0.
02c::14-20:59:54.375 [RM]
PollerThread stopping. Shutdown = 1, Status = 0, WaitFailed = 0,
NotifyEvent address = 196.
0e7::14-20:59:54.375 [CPROXY] Cleaning up
0b9::14-20:59:54.375 [RM]
RundownResources posting shutdown notification.
0e7::14-20:59:54.375 [CPROXY] Cleanup complete.
0e3::14-20:59:54.375 [RM] NotifyChanges shutting down.
0e7::14-20:59:54.375 [CPROXY] Service Stopped.

Perhaps the most meaningful lines from above are:


0e9::14-20:59:43.765 Physical Disk <Disk D:>: SCSI,
error reserving disk, error 21.
0e9::14-20:59:54.125 Physical Disk <Disk D:>: SCSI,
error reserving disk, error 21.
0e9::14-20:59:54.140 Physical Disk <Disk D:>:
Arbitrate returned status 21.

Note: The error code on these logfile entries is 21. You can issue net helpmsg 21 from the
command line and receive the explanation of the error status code. Status code 21 means, "The
device is not ready." This indicates a possible problem with the device. In this case, the device
was turned off, and the error status correctly indicates the problem.

Example 2: Quorum Disk Failure

In this example, the drive has failed or has been reformatted from the SCSI controller. As a
result, the cluster service cannot locate a drive with the specific signature it is looking for.

0b8::14-21:11:46.515 [RM] Main: Initializing.


074::14-21:11:46.531 [FM]
Creating group a1a13a86-0eaf-11d1-8427-0000f8034599
074::14-21:11:46.531 [FM]
Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains
a1a13a87-0eaf-11d1-8427-0000f8034599.
074::14-21:11:46.531 [FM]
Creating resource a1a13a87-0eaf-11d1-8427-0000f8034599
074::14-21:11:46.531 [FM]
FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-
0000f8034599 possible node list
074::14-21:11:46.531 [FMX]
Found the quorum resource a1a13a87-0eaf-11d1-8427-
0000f8034599.
074::14-21:11:46.531 [FM]
All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created
074::14-21:11:46.531 [FM]
arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-
0000f8034599.
074::14-21:11:46.531 FmpRmCreateResource:
creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in
shared resource monitor
074::14-21:11:46.562 FmpRmCreateResource:
created resource a1a13a87-0eaf-11d1-8427-0000f8034599,
resid 1362696
075::14-21:11:46.671 Physical Disk <Disk D:>:
SCSI,Performing bus rescan.
075::14-21:11:51.843 Physical Disk <Disk D:>:
SCSI,error attaching to signature 71cd0549, error 2.
075::14-21:11:51.843 Physical Disk <Disk D:>:
Unable to attach to signature 71cd0549. Error: 2.
074::14-21:11:51.859 [FM] FmGetQuorumResource failed, error 2.
074::14-21:11:51.859 [INIT] Cleaning up failed form attempt.

In this case, the most important logfile entries are:


075::14-21:11:51.843 Physical Disk <Disk D:>:
SCSI, error attaching to signature 71cd0549, error 2.
075::14-21:11:51.843 Physical Disk <Disk D:>:
Unable to attach to signature 71cd0549. Error: 2.

Status code 2 means, "The system cannot find the file specified." The error in this case may
mean that it cannot find the disk, or that, because of some kind of problem, it cannot locate the
quorum logfile that should be on the disk.

Example 3: Duplicate Cluster IP Address

If another computer on the network has the same IP address as the cluster IP address resource,
the resource will be prevented from going online. Further, the cluster name will not be registered
on the network, as it depends on the IP address resource. Because this name is the network name
used for cluster administration, you will not be able to administer the cluster using this name, in
this type of failure. However, you may be able to use the computer name of the cluster node to
connect with Cluster Administrator. Additionally, you may be able to connect locally from the
console using the loopback address. The following sample entries are from a cluster logfile
during this type of failure:

0b9::14-21:32:59.968 IP Address <Cluster IP Address>:


The IP address is already in use on the network, status 5057.
0d2::14-21:32:59.984 [FM] NotifyCallBackRoutine: enqueuing event
03e::14-21:32:59.984 [FM]
WorkerThread, processing transition event for
a1a13a83-0eaf-11d1-8427-0000f8034599, oldState = 129, newState = 4.03e
.
.
.
03e::14-21:32:59.984
FmpHandleResourceFailure:
taking resource a1a13a83-0eaf-11d1-8427-0000f8034599 and dependents offline
03e::14-21:32:59.984 [FM]
TerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599
depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Terminating first
0d3::14-21:32:59.984 Network Name <Cluster Name>:
Terminating name MDLCLUSTER...
0d3::14-21:32:59.984 Network Name <Cluster Name>:
Name MDLCLUSTER is already offline.
.
.
.
03e::14-21:33:00.000 FmpRmTerminateResource:
a1a13a84-0eaf-11d1-8427-0000f8034599 is now offline
0c7::14-21:33:00.000 IP Address <Cluster IP Address>:
Terminating resource...
0c7::14-21:33:00.000 IP Address <Cluster IP Address>:
Address 192.88.80.114 on adapter DC21X41 offline.

Example 4: Evicted Node Attempts to Join Existing Cluster


If you evict a node from a cluster, the cluster software on that node must be reinstalled to gain
access to the cluster again. If you start the evicted node, and the Cluster Service attempts to join
the cluster, entries similar to the following may appear in the cluster logfile:

032::26-16:11:45.109 [INIT]
Attempting to join cluster MDLCLUSTER
032::26-16:11:45.109 [JOIN]
Spawning thread to connect to sponsor 192.88.80.115
040::26-16:11:45.109 [JOIN]
Asking 192.88.80.115 to sponsor us.
032::26-16:11:45.109 [JOIN]
Spawning thread to connect to sponsor 9.9.9.2
032::26-16:11:45.109 [JOIN]
Spawning thread to connect to sponsor 192.88.80.190
099::26-16:11:45.109 [JOIN]
Asking 9.9.9.2 to sponsor us.
032::26-16:11:45.109 [JOIN]
Spawning thread to connect to sponsor NODEA
098::26-16:11:45.109 [JOIN]
Asking 192.88.80.190 to sponsor us.
032::26-16:11:45.125 [JOIN]
Waiting for all connect threads to terminate.
092::26-16:11:45.125 [JOIN]
Asking NODEA to sponsor us.
040::26-16:12:18.640 [JOIN]
Sponsor 192.88.80.115 is not available (JoinVersion), status=1722.
098::26-16:12:18.640 [JOIN]
Sponsor 192.88.80.190 is not available (JoinVersion), status=1722.
099::26-16:12:18.640 [JOIN]
Sponsor 9.9.9.2 is not available (JoinVersion), status=1722.
098::26-16:12:18.640 [JOIN]
JoinVersion data for sponsor 157.57.224.190 is invalid, status 1722.
099::26-16:12:18.640 [JOIN]
JoinVersion data for sponsor 9.9.9.2 is invalid, status 1722.
040::26-16:12:18.640 [JOIN]
JoinVersion data for sponsor 157.58.80.115 is invalid, status 1722.
092::26-16:12:18.703 [JOIN]
Sponsor NODEA is not available (JoinVersion), status=1722.
092::26-16:12:18.703 [JOIN]
JoinVersion data for sponsor NODEA is invalid, status 1722.
032::26-16:12:18.703 [JOIN]
All connect threads have terminated.
032::26-16:12:18.703 [JOIN]
Unable to connect to any sponsor node.
032::26-16:12:18.703 [INIT]
Failed to join cluster, status 0
032::26-16:12:18.703 [INIT]
Attempting to form cluster MDLCLUSTER
.
.
.
032::26-16:12:18.734 [FM]
arbitrate for quorum resource id 24acc093-1e28-11d1-9e5d-0000f8034599.
032::26-16:12:18.734 [FM]
FmpQueryResourceInfo:initialize the resource with the registry information
032::26-16:12:18.734 FmpRmCreateResource:
creating resource 24acc093-1e28-11d1-9e5d-0000f8034599 in shared
resource monitor
032::26-16:12:18.765 FmpRmCreateResource:
created resource 24acc093-1e28-11d1-9e5d-0000f8034599, resid 1360000
06d::26-16:12:18.812
Physical Disk <Disk G:>: SCSI, error attaching to signature b2320a9b, error 2.
06d::26-16:12:18.812
Physical Disk <Disk G:>: Unable to attach to signature b2320a9b. Error: 2.
032::26-16:12:18.812 [FM]
FmGetQuorumResource failed, error 2.
032::26-16:12:18.812 [INIT] Cleaning up failed form attempt.
032::26-16:12:18.812 [INIT] Failed to form cluster, status 2.
032::26-16:12:18.828 [CS] ClusterInitialize failed 2

The node attempts to join the existing cluster, but has invalid credentials, because it was
previously evicted. Therefore, the existing node refuses to communicate with it. The node may
attempt to form its own version of the cluster, but cannot gain control of the quorum disk,
because the existing cluster node maintains ownership. Examination of the logfile on the existing
cluster node reveals that the Cluster Service posted entries to reflect the failed attempt to join:

0c4::29-18:13:31.035 [NMJOIN] Processing request by node 2 to


begin joining.
0c4::29-18:13:31.035 [NMJOIN] Node 2 is not a member of this
cluster. Cannot join.
Top Of Page

Appendix C: Command-Line Administration

You can perform many of the administrative tasks for MSCS from the Windows NT command
prompt, without using the provided graphical interface. While the graphical method provides
easier administration and status of cluster resources at a glance, MSCS does provide the
capability to issue most administrative commands without the graphical interface. This ability
opens up interesting possibilities for batch files, scheduled commands, and other techniques, in
which many tasks may be automated.

Using Cluster.exe

Cluster.exe is a companion program and is installed with Cluster Administrator. While the
Microsoft Cluster Server Administrator's Guide details basic syntax for this utility, the intention
of this section is to complement the existing documentation and to offer examples. All examples
in this section assume a cluster name of MYCLUSTER, installed in the domain called
MYDOMAIN, with NODEA and NODEB as servers in the cluster. All examples are given as a
single command line.

Note: Specify any names that contain spaces within quotation marks.

Basic Syntax
With the exception of the cluster /? command, which returns basic syntax for the command,
every command line uses the syntax:

CLUSTER [cluster name] /option

To test connectivity with a cluster, or to ensure you can use Cluster.exe, try the simple command
in the next section to check the version name (/version).

Cluster Commands

Version Number

To check the version number of your cluster, use a command similar to the following:

CLUSTER mycluster /version

If your cluster were named MYCLUSTER, the above command would return the version
information for the product.

Listing Clusters in the Domain

To list all clusters within a single domain, use a command including the /list option like this:

CLUSTER mycluster /LIST:mydomain

Node Commands

All commands directed toward a specific cluster node must use the following syntax:

CLUSTER [cluster name] NODE [node name] /option

Node Status

To obtain the status of a particular cluster node, use the /status command. For example:

CLUSTER mycluster NODE NodeA /Status

The node name is optional only for the /status command, so the following command will report
the status of all nodes in the cluster:

CLUSTER mycluster NODE /Status

Pause or Resume

The pause option allows the cluster service to continue running and communicating in the
cluster. However, the paused node may not own groups or resources. For example, to pause a
node, use the /pause switch:
CLUSTER mycluster NODE NodeB /Pause

An example of the use of this command might be to transfer groups to another node while you
perform some other kind of task, such as with a backup or disk defrag utility. To resume the
node, simply use the /resume switch instead:

CLUSTER mycluster NODE NodeB /Resume

Evict a Node

The evict option removes the ability of a node to participate in the cluster. In other words, the
cluster node loses membership rights in the cluster. The only way to grant membership rights
again to the evicted node is:

1. Remove the cluster software from the evicted


node through Add/Remove Programs in
Control Panel.
2. Restart the node.
3. Reinstall MSCS on the previously evicted
node through the MSCS Setup program.

To perform this action, use a command similar to the following:

CLUSTER mycluster NODE NodeB /Evict

Changing Node Properties

While the cluster node only has one property that may be changed by Cluster.exe, this example
illustrates how to change a property of a cluster resource. The node description is the only
property that may be changed. For example:

CLUSTER mycluster NODE NodeA /Properties Description="


The best node in MyCluster."

A good use for this node changing property might be in the case of multiple administrators. For
example, you pause a node to run a large application on the designated node, and want to change
the node description to reflect this. The field could serve as a reminder to yourself and to other
administrators as to why it was paused — and that someone may want to /resume the node later.
It might be good to include /resume in a batch file that might pause a node while setting up for
the designated task.

Group Commands

All group commands use the syntax:

CLUSTER [cluster name] GROUP [group name] /option


Group Status

To obtain the status of a group, you may use the /status option. This option is the only group
option in which the group name is optional. Therefore, if you omit the group name, the status of
all groups will be displayed. Another status option (/node) will display group status by node.

Example 1: Status of all groups:

CLUSTER mycluster GROUP /Status

Example 2: Status of all groups owned by a specific node:

CLUSTER mycluster GROUP /Status /Node:nodea

Example 3: Status of a specific group:

CLUSTER mycluster GROUP "Cluster Group"

Create a New Group

It is easy to create a new group from the command line.

Note: The following example creates a group called mygroup:

CLUSTER mycluster GROUP mygroup /create

Delete a Group

Equally as simple as the /create option, you may delete groups from the command line.
However, the group must be empty before it can be deleted.

CLUSTER mycluster GROUP mygroup /delete

Rename a Group

To rename a group, use the following syntax:

CLUSTER mycluster GROUP mygroup /rename:yourgroup

Move, Online, and Offline Group Commands

The move group command may be used to transfer ownership of a group and its resources to
another node. By design, the move command must take the group offline and bring it online on
the other node. Further, a timeout value (number of seconds) may be supplied to specify the time
to wait before cancellation of the move request. By default, Cluster.exe waits indefinitely until
the state of the group changes to the desired state.
Examples:

CLUSTER mycluster GROUP mygroup /MoveTo:Nodeb /wait:120


CLUSTER mycluster GROUP mygroup /Offline
CLUSTER mycluster GROUP mygroup /Online

Group Properties

Use the /property option to display or set group properties. Documentation on common
properties for groups may be found in the Microsoft Cluster Server Administrator's Guide. One
additional property not documented is LoadBalState. This property is not used in MSCS version
1.0, and is reserved for future use.

Examples:

CLUSTER mycluster GROUP mygroup /Properties


CLUSTER mygroup GROUP
mygroup /Properties Description="My favorite group"

Preferred Owner

You may specify a preferred owner for a group. The preferred owner is the node you prefer each
group to run. If a node fails, the remaining node takes over the groups from the failed node. By
setting the fail back option at the group level, groups may fail back to their preferred server when
the node becomes available. A group does not fail back if a preferred owner is not specified.
MSCS version 1.0 is limited to two nodes in a cluster. For best results, specify no more than one
preferred owner. In future releases, this property may use a list of more than one preferred
owner.

Example: To list the preferred owner for a group, type:

CLUSTER mycluster GROUP mygroup /Listowner

Example: To specify the preferred owner list, type:

CLUSTER mycluster GROUP mygroup /Setowners:Nodea

Resource Commands

Resource Status

To list the status of resources or a particular resource, you can use the /status option. Note the
following examples:

CLUSTER mycluster RESOURCE /Status


CLUSTER mycluster RESOURCE myshare /Status

Create a New Resource


To create a new resource, use the /create option.

Note: To avoid error, you must specify all required parameters for the resource. The /create
option allow creation of resources in an incomplete state. Make sure to set additional resource
properties as appropriate with subsequent commands.

Example: Command sequence to add a file share resource

CLUSTER mycluster RESOURCE


myshare /Create /Group:mygroup /Type:"File Share"
CLUSTER mycluster RESOURCE
myshare /PrivProp ShareName="myshare"
CLUSTER mycluster RESOURCE
myshare /PrivProp Path="w:\myshare"
CLUSTER mycluster RESOURCE
myshare /PrivProp Maxusers=-1
CLUSTER mycluster RESOURCE
myshare /AddDependency:"Disk W"

Note: Log entry lines in the sections above have been wrapped for space constraints in this
document. The lines do not normally wrap.

Simulating Resource Failure

You can simulate resource failure in a cluster from the command line by using the /fail option
for a resource. This option is similar to using the Initiate Failure command from Cluster
Administrator. The command assumes that the resource is already online.

Example:

CLUSTER mycluster RESOURCE myshare /Fail

Online/Offline Resource Commands

The /online and /offline resource commands work very much the same way as the corresponding
group commands, and also may use the /wait option to specify a time limit (in seconds) for the
operation to complete.

Examples:

CLUSTER mycluster RESOURCE myshare /Offline


CLUSTER mycluster RESOURCE myshare /Online

Dependencies

Resource dependency relationships may be listed or changed from the command line. To add or
remove a dependency, you must know the name of the resource to be added or removed as a
dependency.
Examples:

CLUSTER mycluster RESOURCE myshare /ListDependencies


CLUSTER mycluster RESOURCE myshare /AddDependency:"Disk W:"
CLUSTER mycluster RESOURCE myshare /RemoveDependency:"Disk W:"

Note: Log entry lines in the sections above have been wrapped for space constraints in this
document. The lines do not normally wrap.

Example Batch Job

The following example takes an existing group, Mygroup, and creates resources within the
group. The example creates a network name resource, and initiates failures to test failover.
During the process, it uses various reporting commands to obtain the status of the group and
resources. This example shows the output from all commands given. The commands in this
example work, but may require minor alteration depending on configured cluster, group,
resource, network, and IP addresses in your environment — if you choose to use them.

Note: The LoadBal properties reported in the example are reserved for future use. The
EnableNetBIOS property for the IP address resource is a Service Pack 4 addition, and must be
set to 1, for the resource to be a valid dependency for a network name resource.

C:\>REM Get group status


C:\>CLUSTER mycluster GROUP mygroup /status
Listing status for resource group 'mygroup':
Group Node Status
-------------------- --------------- ------
mygroup NodeA Online
C:\>REM Create the IP Address resource: myip
C:\>CLUSTER mycluster RESOURCE myip /create /Group:mygroup /Type:"Ip Address"
Creating resource 'myip'...
Resource Group Node Status
-------------------- -------------------- --------------- ------
myip mygroup NodeA Offline
C:\>REM Define the IP Address parameters
C:\>CLUSTER mycluster RESOURCE myip /priv network:client
C:\>CLUSTER mycluster RESOURCE myip /priv address:157.57.152.23
C:\>REM Redundant. Subnet mask should already be same as network uses.
C:\>CLUSTER mycluster RESOURCE myip /priv subnetmask:255.255.252.0
C:\>CLUSTER mycluster RESOURCE myip /priv EnableNetBIOS:1
C:\>REM Check the status
C:\>CLUSTER mycluster RESOURCE myip /Stat
Listing status for resource 'myip':
Resource Group Node Status
-------------------- -------------------- --------------- ------
myip mygroup NodeA Offline
C:\>REM View the properties
C:\>CLUSTER mycluster RESOURCE myip /prop
Listing properties for 'myip':
R Name Value
--------------------------------- -------------------------------
R Name myip
Type IP Address
Description
DebugPrefix
SeparateMonitor 0 (0x0)
PersistentState 0 (0x0)
LooksAlivePollInterval 5000 (0x1388)
IsAlivePollInterval 60000 (0xea60)
RestartAction 2 (0x2)
RestartThreshold 3 (0x3)
RestartPeriod 900000 (0xdbba0)
PendingTimeout 180000 (0x2bf20)
LoadBalStartupInterval 300000 (0x493e0)
LoadBalSampleInterval 10000 (0x2710)
LoadBalAnalysisInterval 300000 (0x493e0)
LoadBalMinProcessorUnits 0 (0x0)
LoadBalMinMemoryUnits 0 (0x0)
C:\>REM View the private properties
C:\>CLUSTER mycluster RESOURCE myip /priv
Listing private properties for 'myip':
R Name Value
--------------------------------- -------------------------------
Network Client
Address 157.57.152.23
SubnetMask 255.255.252.0
EnableNetBIOS 1 (0x1)
C:\>REM Bring online and wait 60 sec. for completion
C:\>CLUSTER mycluster RESOURCE myip /Online /Wait:60
Bringing resource 'myip' online...
Resource Group Node Status
-------------------- -------------------- --------------- ------
myip mygroup NodeA Online
C:\>REM Check the status again.
C:\>CLUSTER mycluster RESOURCE myip /Stat
Listing status for resource 'myip':
Resource Group Node Status
-------------------- -------------------- --------------- ------
myip mygroup NodeA Online
C:\>REM Define a network name resource
C:\>CLUSTER mycluster RESOURCE mynetname /Create /
Group:mygroup /Type:"Network Name"
Creating resource 'mynetname'...
Resource Group Node Status
-------------------- -------------------- --------------- ------
mynetname mygroup NodeA Offline
C:\>CLUSTER mycluster RESOURCE mynetname /priv Name:"mynetname"
C:\>CLUSTER mycluster RESOURCE mynetname /Adddependency:myip
Making resource 'mynetname' depend on resource 'myip'...
C:\>REM Status check
C:\>CLUSTER mycluster RESOURCE mynetname /Stat
Listing status for resource 'mynetname':
Resource Group Node Status
-------------------- -------------------- --------------- ------
mynetname mygroup NodeA Offline
C:\>REM Bring the network name online
C:\>CLUSTER mycluster RESOURCE mynetname /Online /Wait:60
Bringing resource 'mynetname' online...
Resource Group Node Status
-------------------- -------------------- --------------- ------
mynetname mygroup NodeA Online
C:\>REM Status check
C:\>CLUSTER mycluster Group mygroup /stat
Listing status for resource group 'mygroup':
Group Node Status
-------------------- --------------- ------
mygroup NodeA Online
C:\>REM Let's simulate a failure of the IP address
C:\>CLUSTER mycluster RESOURCE myip /Fail
Failing resource 'myip'...
Resource Group Node Status
-------------------- -------------------- --------------- ------
myip mygroup NodeA Online Pending
C:\>REM Get group status
C:\>CLUSTER mycluster GROUP mygroup /status
Listing status for resource group 'mygroup':
Group Node Status
-------------------- --------------- ------
mygroup NodeA Online

You might also like