HA Solutions For Windows, SQL, and Exchange Servers
HA Solutions For Windows, SQL, and Exchange Servers
SERIES
HA
Solutions
for Windows, SQL and Exchange Servers
by Sameer Dandage
Daragh Morrissey
Jeremy Moskowitz
Paul Robichaux
Mel Shum
Ben Smith
Bill Stewart
i
Contents
Chapter 1: Surviving the Worst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Ben Smith
A 6-Step Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Step 1: Identify Critical Business Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Step 2: Map IT Systems to Critical Business Activities . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 3: Model Threats Posed by Predictable and Plausible Events . . . . . . . . . . . . . . . . . . 2
Step 4: Develop Plans and Procedures for Preserving Business Continuity . . . . . . . . . . . . 2
Step 5: Develop Plans and Procedures for Recovering from Disaster . . . . . . . . . . . . . . . . 3
Step 6: Test Business Continuity Plans and Practice Disaster Recovery . . . . . . . . . . . . . . 3
6 Steps Away from Disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 1:
It’s 2:00 A.M. on a Monday morning, and your cell phone rings. The water fountain on the floor
directly over your server room has malfunctioned, and your organization’s servers and routers are
standing in water, as are most of your employees’ workstations. The office opens at 8:00 A.M. What
do you do in the meantime?
Situations like this one separate IT departments that have planned for disaster from those that
haven’t. For the latter group, the situation I’ve described is more than a disaster—it’s an absolute
disaster. When total data loss is possible, the absence of a disaster recovery program can put a
business at risk, particularly small-to-midsized businesses (SMBs), which often don’t have the financial
wherewithal to survive unexpected catastrophic events. Although disasters are inevitable and, to a
degree, unavoidable, being prepared for them is completely within your control. Increasingly, IT has
become the focal point of many companies’ disaster planning. Creating a program to preserve
business continuity and recover from disaster is one of the central value propositions that an IT
department can contribute to an organization.
A 6-Step Plan
In the terminology of disaster planning, two phrases are common: Business Continuity Planning (BCP)
and Disaster Recovery Planning (DRP). Although many people use these phrases interchangeably,
they represent different concepts. BCP traditionally defines planning that ensures an organization can
continue operating when faced with adverse events. DRP is actually a subset of BCP and traditionally
focuses on recovering information and systems in the event of a disaster. As an example, the failure
of a hard disk in a database server is an event that potentially affects business continuity but doesn’t
result from a disaster. However, a water-pipe break that floods a server room and submerges the
database server is a threat to business continuity and within the scope of disaster recovery planning.
BCP and DRP can be complex; in fact, large organizations dedicate groups of people to them.
But without getting into detailed risk analyses and other complexities that usually accompany BCP
and DRP in large companies, all organizations can benefit by following six steps to create a program
that will preserve business continuity and facilitate recovery in the event of disaster.
business decision makers to identify the activities that are essential to your organization’s continued
functioning. Your organization’s BCP will center on preserving continuity of operations by recovering
these services.
Fault tolerance and failover. This countermeasure relies on the use of redundant hardware to
enable a system to operate when individual components fail. In IT, the most common fault tolerance
and failover solutions for preserving IT operations are hard disk arrays, clustering technologies, and
battery or generator power supplies.
Backup. On- and offsite backup programs are a core countermeasure in DRP. Backup gives you the
ability to restore or rebuild recent data to a known good state in the event of data loss.
Cold spares and sites. Cold spares are offline pieces of equipment that you can easily prepare to
take over operations. For example, you might maintain a set of servers that aren’t connected to your
network and that have your company’s standard OS installed and configured. In the event of an
emergency, you can complete the configuration and restore or copy necessary data to resume
operation. Similarly, a cold site is a separate facility that you can use to resume operation if a disaster
befalls your primary facility. Often, a cold site is nothing more than a large room that can
accommodate desks and chairs. For most SMBs, cold sites aren’t cost-effective.
Hot spares and sites. Hot spares are pieces of equipment that are ready for immediate use after a
disaster. For example, you might continuously replicate a critical database’s data to remote facilities so
that client applications can be redirected to the data replicas if necessary. Hot sites are facilities that
let you resume operations in a very short amount of time—typically, a hot site is
operational within the time it takes for employees to arrive at the facility. Hot sites have real-time or
near real-time replicas of data and are always operational. Because hot spares and sites are expensive
to maintain, only organizations that must be operational in a disaster, such as a public safety
organization, use them.
Chapter 2:
When you formulate a backup and recovery strategy for your Windows systems, you need to make
sure to include Group Policy Objects (GPOs) in that strategy. Microsoft provides a means to back up
and restore GPOs in the form of Group Policy Management Console (GPMC), a Microsoft
Management Console (MMC) snap-in that you can use to manage GPOs on Windows Server 2003
and Windows 2000 Server systems. In June, Microsoft released GPMC with Service Pack 1 (SP1)—an
updated version of GPMC—that lets you back up, restore, and copy or migrate GPOs on Windows
2003, Windows XP, and Win2K Server systems without requiring you to have Windows 2003
installed. (Previously, you needed a Windows 2003 license to use GPMC.) This change means that
you can use GPMC to back up, restore, and copy GPOs in domains with any combination of
Windows 2003, XP, and Win2K Server systems. You can download the free GPMC with SP1 at
http://tinyurl.com/ysx4u.
A Little History
In the early days of Active Directory (AD) and Group Policy, the only way to back up and recover
GPOs was to use the NTBackup utility, then perform an AD authoritative restore—a procedure not
for the faint of heart. One irritating characteristic of NTBackup is that it backs up the entire system
state as well as the GPOs themselves and thus requires a hefty chunk of free disk space to house
each instance of the system state.
Performing an authoritative restoration of a GPO that had been accidentally deleted, changed, or
corrupted was even more complicated. First, you had to take offline the domain controller (DC) on
which you ran NTBackup and reboot the DC in Directory Services Restore Mode. Then, you had to
restore the backup to prepare the server with the data you wanted to restore. Finally, you performed
an authoritative restore, which required you to know the complete distinguished name (DN) of the
deleted or modified GPO. Don’t confuse the DN with the GPO’s more familiar friendly name—for
example, “Security Settings for the Sales OU.” The DN is a complicated string that includes the GPO
path in DN format along with the GPO’s globally unique identifier (GUID)—for example,
cn={01710048-5F93-4F48-9DD2-A71C7486C431}, cn=policies,cn=system,DC=corp,DC=com, where the
GUID is the component preceding the first comma. If you didn’t know the GPO’s GUID before the
disaster, you had little hope of recovering it (and thus, little hope of restoring the GPO). At this point
in the GPO restoration process, people often just gave up. Third-party products made the GPO
backup-and-restore process bearable. Indeed, third-party tools are available today that include a GPO
backup-and-restore feature.
Figure 1
Backing Up All GPOs
If you examine the automatically generated subdirectories that the system creates during backup,
you’ll notice that the names of these directories resemble the GUIDs that I described earlier.
However, what isn’t immediately obvious is that these directory and GUID combinations don’t
correspond to the GUID of the underlying GPO and, in fact, are unique and unrelated to the GPO’s
GUID. This distinction lets you back up a GPO without fearing collision with an existing subdirectory.
You can store all the backups in the same subdirectory or in different ones.
Restore GPOs
If a GPO is deleted, corrupted, or becomes otherwise invalid and you want to restore the backed-up
GPO, you can do so at any time by right-clicking the Group Policy Objects node and selecting
Manage Backups. In the Manage Backups dialog box, which Figure 2 shows, you choose which
GPOs you want to restore.
Figure 2
Restoring a Backed-up GPO
Select the location of the GPO backup you want to restore from the Backup location drop-down
list, or click Browse. If you’ve created multiple backups of one or more GPOs in the same directory,
simply select the Show only the latest version of each GPO check box to view the most recent set that
you backed up. Otherwise, all GPOs you wrote to this directory will be displayed along with the time
they were backed up. If you need a reminder about what settings were preserved within a GPO,
simply click the GPO, then click View Settings. Finally, when you’re ready, select the GPO and click
Restore. If you want to remove a GPOs from a particular backup set, you can do so by clicking
Delete.
Figure 3
Import Settings Wizard
The Import Settings Wizard lets you back up the target GPO before you import the source GPO
to it. You need to back up the target GPO only when it isn’t newly created.
The steps I just walked through make up the basic procedure for copying or migrating a GPO by
using the Import command. However, if the GPO from which you want to import settings contains
Universal Naming Convention (UNC) paths or security groups, you’ll probably need to use the GPMC
migration table feature. For instance, the Group Policy Software Installation and Folder Redirection
settings functions use UNC pathnames. To appropriately specify software to distribute, the GPO
typically launches a Windows Installer (.msi) file that’s located in the UNC path—for example,
\\Server1\share. However, a server named Server1 might not exist in the target domain. Or, worse,
Server1 does exist, but you don’t want your users to use that server. To ensure that you import the
correct GPO with the correct UNC and security group references, you need to process your migration
Copy and Import function with a migration table.
The migration table lets you convert any UNC references from the source domain into valid
references in the target domain. The Import Settings Wizard automatically alerts you of UNC paths in
the source domain and gives you two options for handling UNC pathnames, as Figure 4 shows. The
first option, Copying them identically from the source, typically isn’t a wise choice because, as I
mentioned earlier, the UNC or security group references in the source domain might not exist in the
target domain; thus, the GPO probably won’t work after you copy it to the target domain. Therefore,
the better choice is to select the other option, Using this migration table to map them in the
destination GPO. To create your first migration table, at the Migrating References window that
Figure 4 shows, click New. You’ll see the Migration Table Editor, the spreadsheet-like dialog box that
Figure 5 shows. You can start by filling in the table with the information you know. Because you’re
importing from a backup, select Tools, Populate from Backup. Next, select the GPO that you’ll be
migrating. Doing so automatically populates the Source Name column with all the UNC references in
the GPO you’ve specified. Then, simply type a new UNC path (or security group reference) in the
Destination Name field for each UNC path (or security group) you need to migrate. In Figure 5, you
can see that the selected GPO includes the UNC pathname \\OLDServer\Software. However, in the
target domain, this server doesn’t exist. Therefore, you need to enter the appropriate pathname for
the GPO, such as \\NEWServer\OurStuff, to ensure that the GPO has the correct references in the
target domain.
Figure 4
Migrating References window
Figure 5
Migration Table Editor
After you’ve entered the new destination names, select Tools, Validate in the Migration Table
Editor to ensure that all the destination names are valid. After you’ve verified their validity, select File,
Save as to save the migration table you just created, then close it. The Migrating References window
is again displayed. Again, select the Using this migration table to map them in the destination GPO
option, and from the drop-down list select the migration table that you want to use. Typically, you’ll
select the migration table you just created, although you can select a previously created migration
table instead. Using an existing migration table comes in handy when you’re repeating the same
actions—for instance, when you want to transfer the same GPO from one domain to several other
domains. I also recommend that you select the Use migration table exclusively... check box, to ensure
that you always migrate GPOs with valid destination references.
Chapter 3:
With the release of Windows 2000, Microsoft overhauled the built-in backup utility NTBackup and
added media-management and scheduling capabilities. Although these updates are welcome, com-
ments I have read in Web support forums and newsgroups suggest that many administrators and
other users have been frustrated in their attempts to get NTBackup to work in their environments.
At my company, we previously used third-party backup software on Windows NT 4.0 because
NT 4.0’s NTBackup tool wasn’t robust enough for our needs. When we upgraded to Windows Server
2003, I took a second look at NTBackup to determine whether the overhauled version would be
robust enough and whether it could be easily automated. During this process, I discovered several
shortcomings in NTBackup:
• There’s no simple way to write a backup to an arbitrary tape unless you use the /um (unman-
aged) option on the command line. This option lets you overwrite an arbitrary tape. However,
when NTBackup overwrites a tape in a backup job, it always applies a new media label—either
a label you specify or a generic label based on the current date and time. There’s no built-in way
to tell NTBackup to keep the current label but overwrite the tape.
• There’s no simple way to append backup information to an inserted tape because you must use
either the /t (tape name) or /g (globally unique identifier—GUID) options on the command line.
Unmanaged mode won’t work because you can only append to a specifically designated tape.
• NTBackup can’t eject a tape after a backup.
• NTBackup can’t email or print completed job logs.
To overcome these shortcomings, I created a set of scripts, which Table 1 lists. The set includes a
main script and 13 supporting scripts. The main script, Backup.cmd, uses NTBackup to perform a
backup that overwrites the current tape but doesn’t change the tape’s media label. The 13 supporting
scripts perform various functions for Backup.cmd. Together, these scripts provide a backup solution
that you can easily customize and use—no script-writing experience is necessary to run them. You
simply follow these steps:
1. Understand what each supporting script does so you can customize a solution.
2. Understand how the main script works so that you can customize it.
3. Prepare your environment and the scripts.
RSMView.cmd. NTBackup manages tapes and other media by using the Removable Storage
service, which has a database in the %SystemRoot%\system32\NtmsData directory. You can use a
command-line tool called RSM to manage the Removable Storage service. One of RSM’s most useful
commands is Rsm View, which lets you view the objects in the Removable Storage service’s database.
The object types of interest here are the LIBRARY, PHYSICAL_MEDIA, PARTITION, and
LOGICAL_MEDIA objects. (The object model for the Removable Storage database contains many
other types of objects.) A LIBRARY object represents a physical device that uses removable media.
The device we’re interested in is the tape drive. One type of object the LIBRARY object can contain is
the PHYSICAL_MEDIA object, which represents the physical medium—in this case, a tape. The
PHYSICAL_MEDIA object, in turn, can contain a PARTITION object, which represents a partition.
(The GUI uses the term side rather than partition, but the two terms refer to the same thing.) One
object the PARTITION object can contain is the LOGICAL_MEDIA object, which represents the
allocated partition of a tape.
When you use the Rsm View command to display the objects in a container object (i.e. an object
that contains other objects), you must specify the container object by its GUID. Determining the
GUID is simply a matter of using one of RSM’s options. When you use Rsm View with the
/guiddisplay option, RSM outputs a list of objects and their associated GUIDs. With the GUID in
hand, you can cut and paste the GUID into the Rsm View command. Cutting and pasting a couple
of GUIDs isn’t a problem, but having to cut and paste many of them can become quite tedious.
Scripting is the perfect solution to eliminate this tedious task. So, the RSMView.cmd script handles the
Rsm View command in the rest of the scripts. You don’t need to cut and paste a single GUID.
Library.cmd. Library.cmd obtains the tape drive’s friendly name and GUID or just the tape drive’s
GUID. If you run Library.cmd by itself, it will display the tape drive’s name and GUID. This
information is helpful for troubleshooting if you have problems with the scripts. When Backup.cmd
runs Library.cmd, Library.cmd retrieves only the GUID because Backup.cmd doesn’t need the tape
drive’s name.
PhysicalMedia.cmd. PhysicalMedia.cmd obtains the inserted tape’s friendly name and physical-
media GUID (i.e., the GUID for the physical tape) or just the tape’s physical-media GUID. If you run
PhysicalMedia.cmd by itself, it will display the tape’s name and GUID. When Backup.cmd runs
PhysicalMedia.cmd, PhysicalMedia.cmd retrieves only the tape’s GUID.
Partition.cmd. Partition.cmd obtains the partition’s friendly name and GUID or just the partition’s
GUID. If you run Partition.cmd by itself, it will display the partition’s name and GUID. When
Backup.cmd runs Partition.cmd, Partition.cmd returns only the partition’s GUID.
MediaGUID.cmd. MediaGUID.cmd outputs the inserted tape’s logical-media GUID (i.e., the GUID
for the allocated partition of the tape), which is used with NTBackup’s /g option. MediaGUID.cmd
outputs the GUID in the format that NTBackup needs.
MediaName.cmd. MediaName.cmd obtains the inserted tape’s media label as it appears in the Name
field on the Side tab of the tape’s Properties dialog box in the Removable Storage console. Figure 1
shows a sample Properties dialog box. Backup.cmd uses the media label obtained by MediaName
.cmd with NTBackup’s /n option to reapply the existing media label when overwriting a tape.
Figure 1
Sample Properties dialog box
Note that NTBackup uses the Info field when referring to tapes by name. Windows 2003 and
Windows XP automatically set the Name field to the Info field. However, when I was testing Win2K,
I discovered the Name field was left blank, and I had to copy the label in the Info field to the Name
field. In any case, the Name field must match the Info field on the Side tab; otherwise,
MediaName.cmd will fail or return incorrect results.
Refresh.cmd. Depending on your hardware, RSM isn’t always able to detect when tapes are inserted
and removed. Thus, before running a backup job, it’s important to perform a refresh operation to
ensure that the database contains the tape drive’s actual state. Refresh.cmd refreshes the tape drive. If
the refresh operation is successful, Refresh.cmd uses sleep.exe to pause the script so that the database
will return up-to-date information.
Eject.cmd. Eject.cmd ejects the tape after the backup is complete. The script then calls Refresh.cmd
to refresh the tape drive to ensure the Removable Storage service’s database will return up-to-date
information.
ShowLog.cmd. ShowLog.cmd outputs the most recent NTBackup log file (i.e., the log you’d see in
the NTBackup GUI) to the screen. If you run ShowLog.cmd with the /f option, it will output the log’s
full path and filename instead.
PrintLog.cmd. PrintLog.cmd uses Notepad to print the most recent NTBackup log file. You need to
configure a default printer before PrintLog.cmd will work. Summary logs (/L:s) are recommend if you
use PrintLog.cmd.
MailLog.cmd. MailLog.cmd uses blat.exe to email the most recent NTBackup log file to the specified
person (e.g., the administrator). Blat.exe is a free utility that lets you send mail messages from a
command script.
SetupVars.cmd. SetupVars.cmd defines the settings of key environment variables used in the other
scripts. This setup makes the scripts easily portable to different computers. You just need to modify
the settings in one script (i.e., in SetupVars.cmd) rather than 13 scripts. (In Step 3, I discuss the
settings you need to modify.)
TapePrep.cmd. When you have one or more new tapes you want to label and allocate to
NTBackup, you can use TapePrep.cmd. This script runs NTBackup in unmanaged mode so that
NTBackup can use whatever tape happens to be inserted. The script labels the tape with the text
you specify.
You run TapePrep.cmd at the command line and not in a script such as Backup.cmd because
you need to use the script only once per tape. The command to launch TapePrep.cmd is
TapePrep media_label
where media_label is the text you want to use as the label. If the text contains spaces, you must
enclose it in quotes.
Listing 1: Backup.cmd
@Echo Off
Setlocal EnableExtensions EnableDelayedExpansion
BEGIN COMMENT
:: Set up the environment variables.
END COMMENT
Call SetupVars
BEGIN COMMENT
A :: Set up the variables for the current script.
END COMMENT
Set BACKUP=C:\NTBackup
Set JOBNAME=NTBackup tools
Set SETDESC=NTBackup tools [%DTSTAMP%]
Set JOBINFO=NTBackup job ‘%JOBNAME%’ on %COMPUTERNAME%
Set OPTIONS=/hc:on /l:s /m normal /r:yes /v:no
Set LOGFILE=% dpn0.log
Set RC=0
BEGIN COMMENT
:: Refresh the media library.
END COMMENT
Call Refresh > “%LOGFILE%”
BEGIN COMMENT
:: Make sure the tape is inserted.
END COMMENT
Call PhysicalMedia > nul
If %ERRORLEVEL% NEQ 0 (Set RC=%ERR_PHYSICAL_MEDIA%
Call :DIE “%JOBINFO% aborted: Media not detected (!RC!)”
Goto :END)
BEGIN COMMENT
:: Determine the tape’s media label.
END COMMENT
Set MEDIANAME=
For /f “delims=” %%n in (‘MediaName /b’) Do Set MEDIANAME=%%n
If Not Defined MEDIANAME (Set RC=%ERR_PARTITION%
Call :DIE “%JOBINFO% aborted: Unable to determine media name (!RC!)”
Goto :END)
Echo Media name: %MEDIANAME% >> “%LOGFILE%”
BEGIN COMMENT
:: Determine the GUID for the current tape.
END COMMENT
Set MEDIAGUID=
For /f %%g in (‘MediaGUID /b’) Do Set MEDIAGUID=%%g
If Not Defined MEDIAGUID (Set RC=%ERR_LOGICAL_MEDIA%
Call :DIE “%JOBINFO% aborted: Unable to determine media GUID (!RC!)”
Goto :END)
Echo Media GUID: %MEDIAGUID% >> “%LOGFILE%”
:DIE
D Setlocal
Set ERRORMSG=% 1
Echo %ERRORMSG% > “%NTBACKUP_DATA%\error.log”
Echo %ERRORMSG% >> “%LOGFILE%”
If /i “%MAILLOG%”==”YES” call MailLog “%ERRORMSG%”
If /i “%PRINTLOG%”==”YES” call PrintLog
Endlocal & Goto :EOF
:END
Endlocal & Exit /b %RC%
After defining the variables, Backup.cmd uses Refresh.cmd to refresh the tape drive. The script
then performs three important operations. First, Backup.cmd uses PhysicalMedia.cmd to see whether
a tape is inserted in the tape drive. Then, Backup.cmd uses MediaName.cmd to determine the tape’s
media label. Finally, Backup.cmd uses MediaGUID.cmd to determine the tape’s logical-media GUID.
If all three operations succeed, Backup.cmd defines the COMMAND variable, which contains the
NTBackup command, as callout B in Listing 1 shows. The script records the NTBackup command in
the script’s log file named backup.log, then executes the command. After NTBackup completes the
backup, Backup.cmd writes NTBackup’s exit code to backup.log. Finally, Backup.cmd checks the
EJECT, MAILLOG, and PRINTLOG variables, as the code at callout C in Listing 1 shows. If any of
these variables are set to YES, Backup.cmd calls the appropriate scripts.
If any of the three operations fail (i.e., the script can’t detect the tape’s physical-media GUID,
media label, or logical-media GUID), Backup.cmd calls the :DIE subroutine, which callout D in Listing
1 shows. This code displays an error message, creates the error.log file in the NTBackup data
directory, and writes the same error message to error.log and backup.log. Depending on the values
set for the MAILLOG and PRINTLOG variables, the code might also email and/or print the most
recent NTBackup log file, which contains the error message.
Listing 2: SetupVars.cmd
BEGIN COMMENT
A :: Define the computer-specific values. Set RSM_LIBRARY to the tape drive’s
:: name. Set RSM_REFRESH to the number of seconds to wait for the refresh.
:: If you plan to use TapePrep.cmd, set RSM_POOL to the tape drive’s media pool.
END COMMENT
Set RSM_LIBRARY=
Set RSM_REFRESH=
Set RSM_POOL=
BEGIN COMMENT
B :: Define the post-backup behavior. To enable the behavior, change NO to YES.
END COMMENT
Set EJECT=NO
Set MAILLOG=NO
Set PRINTLOG=NO
BEGIN COMMENT
C :: Define the email settings if you set MAILLOG to YES. Set SMTPSERVER
:: to SMTP server’s DNS name or IP address. Set SENDER to the sender’s
:: email address. Set RECIPIENTS to the recipients’ email addresses.
END COMMENT
Set SMTPSERVER=
Set SENDER=
Set RECIPIENTS=
BEGIN COMMENT
:: Creates a date and time stamp in the format: dow yyyy-mm-dd hh:mm.
END COMMENT
Set DTSTAMP=%DATE: 0,3% %DATE: 10,4%-%DATE: 4,2%-%
DATE: 7,2% %TIME: 0,5%
BEGIN COMMENT
:: Do not modify: Error constants
END COMMENT
Set ERR_NOT_FOUND=2
Set ERR_LIBRARY=6
Set ERR_PHYSICAL_MEDIA=7
Set ERR_PARTITION=8
Set ERR_LOGICAL_MEDIA=9
BEGIN COMMENT
Do not modify: Path to NTBackup’s Data directory.
END COMMENT
Set NTBACKUP_DATA=%USERPROFILE%\Local Settings\Application
Data\Microsoft\Windows NT\NTBackup\Data
If you plan to use TapePrep.cmd, you must also define the RSM_POOL variable.
RSM_LIBRARY defines the friendly name for your tape drive. If you’re unsure of this name, you
can open a command-shell window and run the command
RSMView library
to obtain the list of libraries on your system. Then simply copy the friendly name (but not the GUID)
into SetupVars.cmd. You don’t need to enclose the library’s friendly name in quotes if it contains
spaces. To check whether you have specified the library name correctly, type the following command
at a command prompt
Library
You should see the library’s friendly name followed by its GUID. You must make sure that the
library’s friendly name doesn’t contain any of the following reserved shell characters:
( ) < > ^ & |
If your library’s friendly name contains any of these characters, you need to remove them from
the library’s name. To do so, first open the Removable Storage console by selecting Start, Run,
ntmsmgr.msc, OK. In the left pane, expand Libraries (in Windows 2003 and XP) or Physical Locations
(in Win2K). Right-click the tape drive in the left pane and choose Properties. Remove any of the
offending characters from the Name text box, and click OK.
You need to set RSM_REFRESH to the number of seconds that Refresh.cmd should pause after
performing a device refresh. This value depends on your hardware; set it to the number of seconds it
takes to insert and remove a tape. A good starting point is 30 seconds.
If you plan to use TapePrep.cmd, set RSM_POOL to the media pool that matches the type of
media used by your tape drive. For example, if your tape drive uses Travan tapes, specify the Travan
media pool:
Set RSM_POOL=Travan
If the value contains spaces (e.g., 4mm DDS), you don’t need to enclose it in quotes.
The code at callout B in Listing 2 defines the post-backup behavior. Setting the EJECT variable to
YES prompts Eject.cmd to eject the tape. Setting MAILLOG to YES causes MailLog.cmd to email the
NTBackup log file to the specified recipient. Setting PRINTLOG to YES prompts PrintLog.cmd to print
the NTBackup log file on the default printer.
If you set MAILLOG to YES, you need to set the variables in the code at callout C in Listing 2.
(If you set MAILLOG to NO, you don’t need to do anything with this code.) Set the SMTPSERVER
variable to the DNS name or IP address of an SMTP server on your network. (In this example, no
authentication is used. However, blat.exe supports SMTP authentication. For information about how
to incorporate authentication, see the blat.exe documentation.) Set SENDER to the email address you
want to appear in the message’s From field. Set RECIPIENTS to the email address of the person you
want to receive the message. You can specify more than one recipient. Simply separate the addresses
with a comma; no spaces are allowed.
You don’t need to do anything with the remaining code, which defines the DTSTAMP variable,
ERR constants, and NTBACKUP_DATAPATH variable. Backup.cmd uses DTSTAMP to create a date
and time stamp. The ERR constants define the exit codes for the scripts. And the NTBACKUP_
DATAPATH variable points to NTBackup’s data folder, which stores the backup selection (.bks) files
and log files.
Customizing Backup.cmd. Open Backup.cmd in Notepad or another text editor. In the code at
callout A in Listing 1, set the BACKUP variable to the directory or directories you want to back up.
Separate directory names with spaces. If a directory contains spaces, enclose it in quotes. Alterna-
tively, you can specify a .bks file. Enter the full pathname to the file, prefix it with the @ symbol, and
enclose it in quotes if it contains spaces. You can use the NTBACKUP_DATAPATH variable as well.
For example, if your filename is Weekly full.bks, you’d set the BACKUP variable to
“@%NTBACKUP_DATAPATH%\Weekly full.bks”
Set the JOBNAME variable, which is used with NTBackup’s /j option, to the backup job’s name.
This name will appear in NTBackup’s Backup Reports window. Set SETDESC, which is used with
NTBackup’s /d option, to the backup job’s description. This description will appear in the log file and
in the Description column on the Restore tab in NTBackup’s UI. The optional DTSTAMP variable that
appears at the end of the description will let you see at a glance when the backup job ran.
You can customize the JOBINFO variable, but it’s not required. This variable specifies the text
used in error messages and in the email subject line. Note that it references the JOBNAME and
COMPUTERNAME variables so you’ll know at a glance the job and computer that generated the
message.
You use the OPTIONS variable to define NTBackup’s command-line options. For example,
suppose you want NTBackup to append the backup to whatever tape happens to be inserted. You
simply add the /a option to the OPTIONS variable. (You also need to remove /n “%MEDIANAME%”
in the Set COMMAND= line because you can’t use /n with /a.)
You can specify any of NTBackup’s command-line options, except /j, /d, /n, /g, /f, /t, /p, and
/um. As I mentioned previously, /j and /d are already defined in JOBNAME and SETDESC,
respectively. The script automatically determines the tape’s media label (/n) and GUID (/g). The
other options (/f, /t, /p, and /um) shouldn’t be used because they create conflicts.
The LOGFILE variable exists for troubleshooting purposes. Each script’s output is redirected to
this file, along with the tape’s name and logical-media GUID, the NTBackup command line, and that
command’s exit code. If you leave the LOGFILE variable set to %~dpn0.log, the file will be created in
the same directory as Backup.cmd and will have the filename Backup.log.
Ready, Test, Go
After you’ve completed these three steps, the backup scripts are ready to test in a nonproduction
environment. Even though I tested them with Windows 2003, XP, and Win2K, you should always test
any new scripts in a controlled lab environment before deploying them. After you’ve successfully
tested them, you’ll have a flexible backup solution to use.
Chapter 4:
If your system must be highly synchronized but can tolerate a certain amount of latency as well as
some data loss when conflicts arise, SQL Server 2000’s transactional replication with queued updates
(TRQU) can be a useful way to replicate data to multiple, always-on database servers. In this chapter,
I examine the intricacies of backing up and restoring TRQU setups. The following steps, which
Figure 1 shows, summarize the TRQU process.
Figure 1
The TRQU process
For the transactions that follow a forward path from Publisher to Subscriber:
1. SQL Server applies transactions at the Publisher server to the Publisher database log.
2. SQL Server specially marks for replication the transactions that involve the replicated articles
(tables).
3. The Log Reader agent running at the Distributor server periodically reads the Publisher database
log, picks up the transactions marked for replication, then applies them to a table in the
Distributor database on the Distributor server.
4. The Distribution Agent then runs on the Distributor server (for a push subscription) and applies
those transactions to the Subscriber database on the Subscriber server.
For the transactions that follow a reverse path from Subscriber to Publisher:
1. SQL Server records external transactions made by end users or end-user applications at the
Subscriber database in the queue table in the Subscriber database on the Subscriber server.
2. The Queue Reader agent runs periodically on the Distributor server to pick up the transactions
from the queue table on the Subscriber server, checks for and reconciles data conflicts, then
applies the transactions to the Publisher database on the Publisher server.
In a basic transactional-replication scenario that uses SQL Server’s default settings, the Log Reader
agent reads a transaction that SQL Server has marked for replication in the Publisher database log.
Then, after moving the transaction to the Distributor database, the Log Reader agent unmarks the
transaction. Unmarking the transaction is a signal for SQL Server to remove the transaction from the
log after the log backup’s next run. This backup process is consistent and highly effective-until
disaster strikes. Then, the quirks inherent in TRQU implementations cause anomalies that require
creative problem-solving.
Problem. What happens when the Publisher database crashes between times A and B, and SQL
Server can’t back up the log? In this situation, SQL Server has copied the transactions to the
Distributor and perhaps to the Subscriber, but not to the Publisher database’s log backup. Therefore,
if you restore the Publisher to the last good log backup, the Publisher database doesn’t contain the
transactions, even though SQL Server has copied these transactions to the Distributor and perhaps to
the Subscriber-or will copy them to the Subscriber during the next Distribution agent run. This type
of failure compromises the system’s data consistency, and, depending on the nature of the production
data, serious problems can arise.
Solution. SQL Server 7.0 and earlier releases offer no easy way around this problem. DBAs who
work with these releases have to manage the problem more through access and timing control than
through an integrated SQL Server mechanism. The good news is that SQL Server 2000 provides a
feature for resolving this inconsistency easily. You merely set an option called sync with backup by
running the following T-SQL command in the master database on the Publisher server:
EXEC master..sp_replicationdboption @PubDBName ,’sync with backup’, ‘true’
This setting tells the Log Reader agent to read only those marked transactions in the log that SQL
Server has backed up. After the log backup is complete, SQL Server updates the log record by noting
that the transactions have been backed up. Then, during its next run, the Log Reader agent reads the
transactions and unmarks the transactions that were marked to denote they needed replication. The
next log-truncation process removes those transactions from the Publisher database’s active log. This
small, clever feature maintains data consistency in the Publisher backup.
Problem. What happens when the Distributor database crashes before the Distribution agent can run
and SQL Server hasn’t backed up those transactions on the Distributor? Data inconsistency again
exists-the Publisher has copies of the transactions, but the Distribution agent hasn’t copied these
transactions to the Subscriber. Restoring the Distribution database to the last good log backup won’t
retrieve those transactions because SQL Server never backed them up.
Solution. Using sync with backup again is the solution. This time, however, you need to set the
option for the database on the Distributor server. When you set this option, SQL Server can’t delete
the transactions from the Publisher database log until it backs them up at the Distributor server as
part of the Distributor database’s log backup.
At this stage, the Log Reader agent has read the transactions, the Publisher log backup has
backed up the transactions on the Publisher server, but SQL Server hasn’t backed up the Distributor
database log. Under these circumstances, SQL Server doesn’t delete the transactions in the active
Publisher database log until it has backed up the transactions on the Distributor. After the Distributor
database’s log backup is complete, SQL Server marks the transactions in the Publisher’s active log.
Then, the transactions are deleted from the Publisher database’s active log during the next truncation
process. Therefore, if the Distribution database crashes before SQL Server has backed up the
transactions on the Distributor, data-consistency problems don’t occur. After the Distributor database
restore, the Log Reader agent picks up the transactions from the active Publisher log based on the
transactionID.
Problem. Sometimes in a TRQU project, updates occur at the Subscriber as well as at the Publisher.
What do you do when the Subscriber database crashes?
Solution. As long as the replicated transactions are stored in the Distributor database, you haven’t
lost anything. You can use the @min_distretention option of the sp_adddistributiondb stored
procedure that sets the minimum transaction retention period for the Distributor database. The
distribution-cleanup process ensures that the process doesn’t delete transactions from the msrepl_
commands table in the Distributor database that occur within that minimum retention period. So, if
the Subscriber database crashes, you can restore it to the last good backup. And, assuming that your
minimum transaction-retention period is more than the time elapsed since the last good backup of
the Subscriber database, the replicated transactions are safe. The Distribution agent determines the
state of the Subscriber database and starts reapplying transactions from the point to which the
database has been restored. Therefore, you should consider using options such as sync with backup
and minimum history retention when you estimate the log sizes and size of the database (e.g., for the
msrepl_commands table at the Distributor, the Queue table at the Subscriber, or conflict tables at the
Publisher and the Subscriber).
Queue Complications
The sync with backup solutions to the Publisher and Distributor database crashes work when the
updates take place only at the Publisher. However, as I noted earlier, TRQU updates can take place
simultaneously in more than one database. What happens in a TRQU scenario in which updates are
applied simultaneously at the Subscriber and at the Publisher when the sync with backup option is
on? Visualize the following scenario: An update takes place at the Subscriber and this transaction is
recorded in the queue table on the Subscriber. The Queue Reader agent reads this transaction and
applies it to the Publisher (at which stage SQL Server makes a log entry in the Publisher log). The
Queue Reader agent then deletes the transaction from the queue table on the Subscriber, but before
you can back up the Publisher log, the Publisher database crashes, and you can’t back up the
transaction log.
Problem. Unlike with the forward path (from Publisher to Subscriber), in which the transactions are
preserved in the distribution database, no transactions are preserved in the reverse direction. This loss
of data occurs because the Queue Reader deletes the transactions from the queue table after applying
them to the Publisher.
Solution. Typically, if you want to minimize the loss of transactions in such a disaster scenario, you
undo replication, get the failed SQL Server database up and running, then set up replication again
with the roles reversed. The original Subscriber then becomes the Publisher, and the original
Publisher becomes the new Subscriber.
However, even when the Publisher database has crashed, you can salvage your data if you can
back up the Publisher log. To accomplish this backup, you need to set up the TRQU to allow the
Log Reader to read transactions from the available Publisher database log (after the log backup), stop
all the replication agents, then restore the Publisher database to the last backed-up log. After you
restart all the agents, you can resume replication.
Chapter 5:
As a DBA, one of your many tasks is to manage your SQL Server databases’ ever-expanding storage
requirements. How often do you find yourself adding more disk, trying to accurately size a database,
or wishing you could more efficiently use your existing disk capacity? Storing database data on a SAN
can make such tasks much easier and can also improve disk performance and availability and shorten
backup and restore times. Start your search for a SAN here, as you learn the basics of SAN
technology and the benefits of using a SAN to store SQL Server databases. And the sidebar “Selecting
a Storage Array for a SAN” covers several features you’ll want to consider when selecting a storage
array for your SAN.
SAN Fundamentals
A SAN is basically a network of switches that connect servers with storage arrays. SAN topology is
similar to how Ethernet switches are interconnected, as Figure 1 shows. A SAN’s physical layer
comprises a network of either Fibre Channel or Ethernet switches. Fibre Channel switches connect to
host bus adapter (HBA) cards in the server and storage array. Ethernet switches connect to Ethernet
NICs in the servers and storage array.
Figure 1
SAN Topology
A storage array is an external disk subsystem that provides external storage for one or more
servers. Storage arrays are available in a range of prices and capabilities. On the low end, an array
consists simply of a group of disks in an enclosure connected by either a physical SCSI cable or Fibre
Channel Arbitrated Loop (FC-AL). This type of plain-vanilla array is also commonly called Just a
Bunch of Disks (JBOD). In high-end arrays, storage vendors provide features such as improved
availability and performance, data snapshots, data mirroring within the storage array and across
storage arrays, and the ability to allocate storage to a server outside the physical disk boundaries that
support the storage.
Two types of SANs exist: Fibre Channel and iSCSI. Fibre Channel SANs require an HBA in the
server to connect it to the Fibre Channel switch. The HBA is analogous to a SCSI adapter, which lets
the server connect to a chain of disks externally and lets the server access those disks via the SCSI
protocol. The HBA lets a server access a single SCSI chain of disks as well as any disk on any storage
array connected to the SAN via SCSI.
iSCSI SANs use Ethernet switches and adapters to communicate between servers and storage
arrays via the iSCSI protocol on a TCP/IP network. Typically, you’d use a Gigabit Ethernet switch and
adapter, although 10Gb Ethernet switches and adapters are becoming more popular in Windows
server environments.
On a SAN, a server is a storage client to a storage array, aka the storage server.The server that
acts as the primary consumer of disk space is called the initiator, and the storage server, which
provides the disk space, is called the target.
The disks that the storage arrays provide on the SAN are called LUNs and appear to a Windows
server on the network as local hard drives. Storage-array vendors use a variety of methods to make
multiple hard drives appear local to the storage array and to represent a LUN to a Windows server by
using parts of multiple hard drives. Vendors also use different RAID schemes to improve performance
and availability for data on the LUN. Whether the SAN uses Fibre Channel or Ethernet switches,
ultimately what appears from the Windows server through the Microsoft Management Console (MMC)
Disk Management snap-in are direct-attached disks, no different from those physically located within
the server itself. In addition, most arrays have some type of RAID protection, so that the storage that
represents a given LUN is distributed across multiple hard drives that are internal to the storage array.
SAN Security
SAN architecture provides two measures for securing access to LUNs on a SAN.The first is a
switch-based security measure, called a zone. A zone, which is analogous to a Virtual LAN (VLAN),
restricts access by granting only a limited number of ports on several hosts an access path to several,
but not all, storage arrays on the SAN.
The second security measure is storage-array-based; a storage array can use LUN masking to
restrict access. Depending on the vendor, this security feature comes free of charge with the storage
array or is priced separately as a licensed product. LUN masking can be configured either by the
administrator or by the storage-array vendor for a fee. When masking is configured, the array grants
only explicitly named ports of named hosts an access path to the specified LUNs. LUN masking func-
tions similarly to ACLs on Common Internet File System (CIFS) shares in a Windows environment.
SAN Benefits
Now that you have a grasp of what a SAN is, you’re probably wondering how a SAN could benefit
your SQL Server environment. To address this question, we’ll first examine problems inherent in local
DAS, then explore how using a SAN avoids these problems.
Performance and availability. As part of the typical process of designing a database that will
reside on a local disk, or DAS, you’d determine how the disks on which the database will be stored
are attached (i.e., which disks are attached to which SCSI adapter).You want to carefully organize the
database files to minimize contention for disk access—for example, between a table and indexes on
the table, two tables that are frequently joined together, or data and log files.To minimize contention
(i.e., disk I/O operations), you’d try to ensure that the two contending objects are separated not only
on different disks but also across SCSI adapters.
Another disk-related issue that you must consider in designing a database is availability.You need
to use some type of disk redundancy to guard against disk failures. Typically, you’d use either RAID
1 ( mirroring) or RAID 5 to provide redundancy and thus, improved availability.
After you create the RAID devices by using Windows’ Disk Management, you might lay out the
database across these multiple RAID storage structures. When allocating such structures, you have to
decide how to size them. Determining the amount of storage each server needs is like estimating
your taxes: If you overestimate or underestimate taxes or storage needs, you’ll be penalized either
way. If you overestimate your storage and buy too much, you’ll have overspent on storage. If you
underestimate your storage needs, you’ll soon be scrambling to find ways to alleviate your shortages.
A SAN addresses the issues of contention, availability, and capacity. On a SAN, the storage array
typically pools together multiple disks and creates LUNs that reside across all disks in the pool.
Different disks in the pool can come from different adapters on the storage array, so that traffic to
and from the pool is automatically distributed. Because the storage array spreads the LUNs across
multiple disks and adapters, the Windows server that’s attached to the SAN sees only a single disk in
Disk Management.You can use just that one disk and not have to worry about performance and
availability related to the disk, assuming that your storage or network administrator has properly
configured the SAN.
How complex or simple a storage array is to configure depends on the vendor’s implementation.
I recommend that you meet with the IT person responsible for configuring your storage and ask
him or her to explain your storage array’s structure. Also, determine your storage requirements ahead
of time and give them to this person. In addition to storage size, note your requirements for
performance (e.g., peak throughput—40Mbps); availability (e.g., 99.999 percent availability); backup
and recovery (e.g., hourly snapshot backups take 1 minute; restores take 10 minutes); and disaster
recovery, based on metrics for recovery time objective (RTO)—the time it takes to restore your
database to an operational state after a disaster has occurred—and recovery point objective (RPO)—
how recent the data is that’s used for a restore. Using these metrics to define your requirements will
help your storage administrator better understand your database-storage needs.
Some vendors’ storage arrays let you dynamically expand a LUN that you created within the disk
pool without incurring any downtime to the SQL Server database whose files reside on that LUN.This
feature lets DBAs estimate their disk-space requirements more conservatively and add storage capacity
without downtime.
Backup control. As a database grows, so does the amount of time needed to perform database
backups. In turn, a longer backup requires a longer backup window. Partial backups—such as
database-log-backups— take less time but require more time to restore. Increasingly, upper
management is mandating smaller backup windows and shorter restore times for essential
applications, many of which access SQL Server databases. SANs can help decrease backup windows
and restore times. Some storage arrays can continuously capture database snapshots (i.e., point-in-
time copies of data), which are faster to back up and restore than traditional database-backup
methods. The snapshot doesn’t contain any actual copied data; instead, it contains duplicate pointers
to the original data as it existed at the moment the snapshot was created.
To back up SQL Server database data by using snapshots, you’d typically want to put your
database in a “ready” state, more commonly called a hot-backup state, for a few moments to perform
the snapshot. If you didn’t put your database in a hot-backup state, the snapshot could take a
point-in-time copy of your database before SQL Server has finished making a consistent database
write. Storage-array vendors often use Microsoft’s SQL Server Virtual Backup Device Interface (VDI)
API to enable their software to put the database in a hot-backup state.This lets the system copy the
point-in-time snapshot image to separate backup media without causing a database outage.
Snapshots are minimally intrusive, so you can use them frequently without affecting database
performance. Restoring data from a snapshot takes only a few seconds. By using a SAN-connected
storage array along with a snapshot capability, DBAs can minimize backup windows and restore
times, in part because snapshot images are maintained on distributed disks in the array, instead of on
one local disk.
Reduced risks for database updates. Changes to a database, such as SQL Server or application
upgrades or patches, can be risky, especially if the changes might cause database outages or worse,
database corruption. To test changes without putting the production database at risk, you’d need to
set aside an amount of storage equivalent to the size of the production database. On this free storage,
you’d restore the last recent backup of that database (typically 1 week old).You’d spend a few hours
(maybe even days) restoring the database from tape to disk, applying the changes, then testing to see
whether the changes were successfully applied and whether they adversely affected the database.
After you verified that the changes were successfully implemented, you’d apply them to the
production database.
Some vendors’ SAN storage arrays let you quickly clone your database data for testing purposes.
Cloning the data takes only a few seconds versus hours to restore it from tape. The added benefit of
cloning is reduced disk utilization. Some cloning technology lets you take a read-only database
snapshot and turn it into a writeable clone. For testing purposes, the clone consumes far less disk
storage than a full backup of a database because only modified blocks of data are copied to the
clone database.
SAN switches connected to storage arrays. If the SAN wasn’t properly designed and configured (i.e.,
to provide redundancy), the storage array or a component on the SAN could fail, so that servers on
the SAN couldn’t access data on the storage array.
To enable you to troubleshoot storage problems, you’ll need to make sure that SQL Server bina-
ries and message-log files stay on the local disk. Storing the message log and binaries on a disk other
than the local disk puts the database in a Catch-22 situation, in which a database-access failure
caused by a storage-connection failure can’t be logged because logging occurs only for the device on
which the logs and binaries are stored.
The second methodology is to copy the snapshot block to a free block, then overwrite the block
that was just copied. This approach, which Figure B shows, is often called copy-on-write. Copy-on-
write requires more data movement and overhead on the storage array’s part than the first approach.
In Figure B, block D is moved from the current block to a new block so that the new contents of D
can be written to D’s old location. Doing so requires three block I/Os and an update to a link,
whereas the first approach requires only one block I/O.This difference becomes significant for disk
performance as large numbers of blocks are updated.
Figure B
Second snapshot-updating methodology
Support for Fibre Channel and iSCSI on the same array. Consider buying a storage array that
supports both Fibre Channel and iSCSI, so that you have the flexibility to switch from one to the
other or implement both. (For example, you might want to use an iSCSI SAN for testing and
development and use a Fibre Channel SAN for production.)
Ability to create, grow, and delete LUNs dynamically. Being able to create, grow, and delete
LUNs without bringing a database down is a major benefit of putting the database on a SAN. If you
need this capability, consider storage arrays that provide it.
Integration of snapshot backups with SQL Server. The process of taking a snapshot copy of your
SQL Server database needs to be coordinated with your database and NTFS. Storage-array vendors
can use Microsoft’s SQL Server Virtual Backup Device Interface (VDI) API to accomplish this
coordination. If the snapshot process isn’t synchronized with NTFS and the database, the created
snapshot might not be in a consistent state because either NTFS or the database might not have
completely flushed pending writes from memory to the LUN.
A uniform storage OS as you scale up. You’d most likely want to start with a small storage array
to test and validate the SAN’s benefits before deploying it enterprise-wide. Look for a storage array
that lets you grow without having to do a “forklift” upgrade or having to learn a new storage OS.
Maintaining a consistent OS lets you upgrade your storage array as your needs grow, with a
minimum of database downtime.
A transport mechanism to mirror data over the WAN to a recovery site. The storage array
should provide a uniform transport method for sending mirrored data across the WAN to another
storage array for disaster recovery purposes.
Ability to instantaneously create a writeable copy of your database. Look for storage arrays that
let you instantaneously-create a writeable copy (i.e., clone) of your database for testing upgrades and
large data loads without affecting the production-database. This feature could reduce outages and
corruption of the production database, giving DBAs a tool to test major changes without endangering
data.
Step Up to a SAN
As you can see, housing databases on a SAN can benefit DBAs in various ways. SANs can reduce the
pain of sizing storage requirements for databases, enhance overall storage throughput, simplify
storage performance tuning, and improve availability. Using a SAN can also decrease backup and
restore windows and enables quicker and easier testing cycles and reduced overhead in test storage.
The availability of iSCSI removes the cost barriers that have until now inhibited some users from
investigating SANs. Now’s the time to check out SAN technology and see whether it can improve
your database-storage environment.
Chapter 6:
I’m often asked about using a redundant array of inexpensive disks (RAID) for fault tolerance. Should
SQL Server be installed on a RAID device? The answer is yes, if you can afford it. RAID is the easiest
and best way to implement fault tolerance, and you definitely want your production databases
housed in a fault-tolerant environment. RAID server systems are more expensive than ordinary s
ingle-disk servers because of the additional hardware and software you use to implement the RAID
configurations. Several types of RAID are commonly available, and each has its own specific use.
RAID 0 uses striping, a method that creates a disk partition that spans multiple hard drives to
take advantage of the many operational read/write heads associated with multiple spindles (similar in
concept to Windows NT striping). RAID 0 is the fastest type of RAID, but unlike most RAID
implementations, it doesn’t provide fault tolerance. If one of the drives in a RAID 0 configuration fails,
all the data is lost. Don’t use RAID 0 if you value your data.
RAID 5 is the most common way to implement fault-tolerance. RAID 5’s read data transaction
rate is high, and its write data transaction rate is also good compared to other configurations. RAID 5
also offers a good aggregate transfer rate. A typical RAID 5 configuration contains three or more hard
drives. RAID 5 divides up the data and writes it in chunks spread across all the disks in the array.
Redundancy is provided by the addition of data parity information, which the RAID controller
calculates from the data, and which is written across all the disks in the array, interleaved with the
data. The parity information enables the RAID controller to reconstruct data if one of the disks and
the data stored on it is lost or corrupted. RAID 5 is the most cost-effective way to implement
fault-tolerance. Store SQL Server system and user data files on a RAID 5 device.
RAID 1 uses mirroring, a method in which each drive has a mirror copy on another drive.
Mirroring is the most fault-tolerant RAID scheme, but it’s also the most expensive because of the
additional hardware and software you need to support mirroring. SQL Server stores data sequentially
on the transaction logs and in TempDB, which makes these essential parts of your database well
suited for RAID 1 protection. Put the transaction log and TempDB on a RAID 1 device at least, even
if you can’t afford RAID for any other parts of your database.
RAID 10 is a combination of RAID 1 and RAID 0 that uses mirroring and striping. It’s expensive,
but it’s fast and provides the best redundancy and performance. RAID 10 involves mirroring two or
more RAID 0 drives. If you can afford it, put the transaction log and TempDB on a RAID 10 device
rather than a RAID 1 device for the extra protection that it offers. For a good comparison of RAID
types, see Advanced Computer and Network Corporation’s RAID tutorial at
http://www.acnc.com/raid.html.
A typical approach to high availability in Analysis Services 2000 is to use Windows Network Load
Balancing (NLB) to distribute user queries across multiple Analysis Services instances on disparate
machines while also increasing availability. You keep the databases on these machines in sync with
file-based backup and restore (required for large databases due to the 2GB .cab file size limitation)
from a secondary server on which cube and dimension processing is performed. For more
information, read the Microsoft white paper “Creating Large-Scale, Highly Available OLAP Sites” at
http://www.microsoft.com/sql/evaluation/bi/creatingolapsites.asp.
Even though Analysis Services 2000 isn’t cluster-aware, you can cluster an Analysis Services 2000
database. SQL Server 2005 Analysis Services is cluster-aware and fully supports active-active clustering,
which means you can create a failover cluster to ensure high availability. In addition, Analysis
Services 2005 has a server-synchronization feature, which lets you incrementally synchronize metadata
and data changes between a source database and a destination (production) database while users
continue to query the destination database. Unlike Analysis Services 2000, with Analysis Services 2005
you can’t simply copy the data files from the data folder of an Analysis Services 2005 instance on
one machine to the data folder on another machine. These files are encrypted by default (using the
machine name as part of the key) unless you modify the .config file to change the RequiredProtection-
Level setting from 1 to 0. Finally, in Analysis Services 2005, you can back up any size database (the
2GB .cab file limit has been removed), which means you can use back up and restore to move any
size database between servers.
Of all a DBA’s missions, none is more important than ensuring that vital business services are
available to end users. All of your high-end scalability hardware and modern .NET coding techniques
will make little difference if users can’t access data. Unplanned downtime for an application or the
database server can cost an organization dearly in money and reputation. Outages for large online
retailers or financial institutions can cost millions of dollars per hour, and when users can’t access a
site or its vital applications, the organization loses face and customer goodwill.
Microsoft and other enterprise database vendors have devised several high-availability
technologies. For example, Microsoft Clustering Services lets one or more cluster nodes assume the
work of any failed nodes. Log shipping and replication help organizations protect against both server
and site failure by duplicating a database on a remote server. And traditional backup-and-restore
technology protects against server and site failure as well as application-data corruption by
periodically saving a database’s data and log files so you can rebuild the database to a specified date
and time. Although these technologies can help you create a highly available environment, by
themselves they can go only so far. Technology alone can’t address two critical pieces of the complex
high-availability puzzle: the people and processes that touch your system.
Server and site failure can produce downtime, but they’re relatively rare compared to human
error. The mean time between failures (MTBF) for servers is high, and today’s hardware, although
not perfect, is usually reliable, making server failures uncommon. In contrast, users, operators,
programmers, and administrators interact with your systems virtually all the time, and the high volume
gives more chances for problems to arise. Thus, the ability to quickly and efficiently recover from
human errors is essential for a highly available system. An operator error can take down a database
or server in a few seconds, but recovery could take hours. However, with proper planning, you can
reduce downtime due to human error by creating adequate application documentation and by
ensuring that personnel receive proper training.
Processes are also critical for a highly available environment. Standardized operating procedures
can help reduce unnecessary downtime and enable quicker recovery from planned and unplanned
downtime. You need written procedures for performing routine operational tasks as well as
documentation that covers the steps necessary to recover from various types of disasters. In addition,
the DBA and operations staff should practice these recovery plans to verify their accuracy and
effectiveness. Another process-related factor that can contribute to high availability is standardizing
hardware and software configurations. Standardized hardware components simplify implementing
system repairs and acquiring replacement components after a hardware failure. Standardized software
configurations make routine operations simpler, reducing the possibility of operator error.
Creating a highly available environment requires more than just technology. Technology provides
the foundation for a highly available environment. But true high availability combines platform capa-
bilities, effective operating procedures, and appropriate training of everyone involved with the system.
Cluster Service
Microsoft Cluster service provides a high degree of database protection as well as automatic failover
by letting you set up two or more servers in a cluster. If one server fails, its workload is automatically
transferred to one of the remaining servers in the cluster. SQL Server 2000 Enterprise Edition supports
Cluster service, but it can be expensive to implement because it requires multiple servers that must
come from the Microsoft Hardware Compatibility List (HCL).
Log Shipping
Log shipping protects against server and database failure by creating a backup of the original
database on the primary server, then restoring the backup to the standby server. The standby server
is in a state of continuous recovery so that transaction logs captured on the primary server are
automatically periodically forwarded and applied to the standby server. If you use TCP/IP as the data
transport, you can operate the primary and standby servers in different locations. You have to initiate
the failover process manually. Log shipping is included in the SQL Server 2000 Enterprise Edition, but
it’s a less expensive option than Cluster service because the servers don’t have to come from the HCL
list and you can manually implement it on any server that runs SQL Server.
Replication
Typically, transactional replication is a feature you use for distributed data but it also functions as a
high-availability solution that protects against server and site failure by duplicating data between
geographically separated servers. When you use replication, both your primary server and your
backup servers can actively provide database services. Switching from the primary server to the
backup server containing the replicated data is a manual process. All SQL Server editions support
transactional replication.
Database Mirroring
SQL Server 2005 will introduce database mirroring, which uses a primary server, a mirrored server,
and a witness server (that monitors the database mirror’s status) with Transparent Client Redirection
(in Microsoft Data Access Components—MDAC) to provide a database-level high-availability solution.
Database mirroring, basically built-in real-time log shipping, begins by restoring a database backup on
the mirrored server, then forwarding transaction logs in realtime from the primary server to the
mirrored server. The failover process is automatic; if the MDAC layer on the client fails to connect to
the primary server, it can automatically connect to the mirrored server.
Backing up your transaction log lets you maintain a record of all the changes to a SQL Server
database so that you can restore it later if you need to. The following list will help you remember the
key features of log backups so that you can use them to your best advantage.
Use the full or bulk-logged recovery model. If your database is in the simple recovery model, you
can’t make log backups because SQL Server will truncate your log periodically.
Store your transaction log on a mirrored drive. Even if your data files are damaged and the
database is unusable, you can back up the transaction log if the log files and the primary data file are
available. Use a RAID level that guarantees redundancy, such as 1 or 10, and you’ll be able to back
up all the transactions to the point of failure, then restore them to the newly restored database.
Monitor log size. Although carefully planning how large your log should be is vital, don’t assume it
will never grow bigger than it did during testing. Use SQL Agent Alerts to watch the Performance
Monitor counters that track file size, and when the log crosses a threshold that you define, SQL Server
Agent can take predetermined actions such as running a script to increase the log size, sending you
email, or shrinking the file.
Remember that log backups are non-overlapping. In SQL Server 2000 and 7.0, each log backup
contains all transactions since the previous log backup, so a long-running transaction can span
multiple log backups. So when you’re restoring log backups, don’t use the WITH RECOVER option
until you’ve applied the last log—later log backups might contain the continuation of the transactions
in the current log backup.
Understand the difference between truncating and shrinking. Backing up the log performs a
truncate operation, which makes parts of the log available for overwriting with new log records. This
doesn’t affect the physical size of the log file—only shrinking can do that.
Carefully plan how often to make log backups. There’s no one-size-fits-all answer, and you’ll
always have trade-offs. The more often you make log backups, the more backups you’ll have to
manage and restore, but the less likely you’ll be to lose transactions in case of a system failure.
The log size doesn’t always reflect the log-backup size. If your database is using the bulk-logged
recovery model, the log backups will include all data that the bulk operations affected, so the
backups can be many times as large as the log file.
Maintain log backups for at least two previous database backups. Usually, when you restore a
database, you apply all log backups you made after the database backup. But if a database backup is
damaged, you can restore an earlier database backup and apply all the logs made after that backup.
For full recovery, you just need to start your restore with a full database backup, then apply an
unbroken chain of log backups to that database.
You need log backups to restore from file or filegroup backups. If you’re planning to restore
from individual files or filegroups, you need log backups from the time the file or filegroup backup
was made until the time you restore the backup. The log backups let SQL Server bring the restored
file or filegroup into sync with the rest of the database.
To restore to a specific point in time, you need a log backup made in full recovery model.
Restoring a log backup to a specific point in time requires that the log contain a sequential record of
all changes to the database. If you’re using the bulk-logged model and you’ve performed any
bulk-logged operations, the log won’t be a complete sequential record of the work, so you can’t do
point-in-time recovery.
You might need to mix log backups with differential backups. If certain data changes
repeatedly, a differential backup will capture only the last version of the data, whereas a log backup
will capture every change. Because SQL Server’s Database Maintenance Plan Wizard doesn’t give
options for differential backups, you need to define your own jobs for making differential backups.
Practice recovering a database. Plan a recovery-test day to make sure your team knows exactly
what to do in case of a database failure. You may have the best backup strategy in the world, but if
you can’t use your backups, they’re worthless.
SQL Server backups are useless if you can’t recover them. Backups are simply big disk files unless
you have a recovery mechanism that puts those bits back into SQL Server when you need them. So
when was the last time you tested your restore strategy? I’m not asking whether you’ve performed a
backup and run the RESTORE command manually to see whether the media is valid. I’m asking
whether you’ve tested your restore methodology to make sure it works the way you think it does.
Can you get your production database back up and running after a disaster? You must have a
planned and tested restore methodology to be sure.
At a SQL Server Magazine Connections in Orlando, Florida, I sat in on Kimberly Tripp’s talk
about SQL Server backup and restore. Kimberly presented a number of interesting tips, but her
fundamental backup tenet is that the backup is useless without the ability to restore it. And you
don’t know that you can restore your backup unless you’ve fully tested your plan.
I suspect that many of you don’t have a well-tested backup and recovery plan. Testing backup
and recovery plans can be difficult, especially if you don’t have the hardware resources to do a
complete dry run of a failure and recovery. For example, properly testing your restore methodology
is hard to do if your production system is a one-tier warehouse and you don’t have a test server of
equal capacity. But budgeting for adequate testing and quality assurance equipment should be a
non-negotiable part of an efficient data center. If you haven’t planned how to recover your data and
tested that plan, when a true disaster happens, you’re asking for trouble.
If you haven’t tested your backup and recovery plan, your backups might not be as valuable as
you think they are. Backing up is easy; getting the data back can be the hard part.
Chapter 7:
Recently, a company that’s thinking about deploying a SAN for its Exchange servers contacted me.
The company wanted to know whether a SAN made sense for its organization as well as how best to
configure and tune a SAN. Read on to learn the answers to these questions.
Figuring out whether a SAN makes sense for a given organization can be tricky because the term
spans a wide range of technology and complexity. For example, you could claim that the old Dell
650F storage enclosure I owned several years ago was a SAN. It had a Fibre Channel interconnect,
and I used it as shared storage for a three-node cluster. It didn’t, however, have replication, dynamic
load balancing, or much expandability. Since that time, the SAN category has broadened so that it
includes two primary classes of devices.
Fibre Channel SANs use optical fiber (or, in rare cases, copper cables) to interconnect SAN
devices. Each node on the SAN requires a Fibre Channel host bus adapter (HBA), and most Fibre
Channel SANs use a fibre switch to provide mesh-like connectivity between nodes. Fibre Channel
speeds range from 1GBps to 4GBps and, with the right implementation, can span distances up to
100 kilometers.
iSCSI SANs are a relatively new, lower-cost way to implement SANs. Instead of using optical fiber
or copper, iSCSI SAN uses TCP/IP over ordinary network cabling. Its advantages are pretty obvious:
lower costs and more flexibility. Instead of spending big bucks on Fibre Channel HBAs and switches,
you can deploy lower-cost Gigabit Ethernet HBAs (which are, more or less, ordinary network adapter
cards) and switches, and it’s much easier to extend the distance between SAN devices without
resorting to a backhoe.
In either case, the primary advantages of SANs are their flexibility, performance capabilities, and
support for high availability and business continuance. Let’s consider each of these advantages
separately.
SAN gets its flexibility from the fact that it’s a big collection of physical disks that you can
assemble in various logical configurations. For example, if you have an enclosure with 21 disks, you
can make a single 18-disk RAID-5 array with three hot spares, a pair of 9-disk RAID-5 arrays with
three hot spares, or one whopping RAID-1+0 array (although I would be loath to give up those
spares). In theory, these configurations let you build the precise mix of logical volumes you need and
tailor the spindle count and RAID type of each volume for its intended application. (In practice,
sometimes this doesn’t happen, as I’ll discuss next week.)
SAN’s performance capabilities are the result of two primary factors: lots of physical disks and a
big cache. Which of these is the dominant factor? It depends on the mix of applications you use on
the SAN, how many disks you have, and how they’re arranged. When you look at a SAN’s raw
performance potential, remember that the SAN configuration will have a great effect on whether you
actually realize that degree of performance.
When it comes to high availability and business continuance, even if you use your SAN only as a
big RAID array, you’ll still get the benefit of being able to move data between hosts on the SAN.
SANs also make it much easier to take point-in-time copies, either using Microsoft Volume Shadow
Copy Service (VSS) or vendor-specific mechanisms. Add replication between SAN enclosures, and you
get improved redundancy and resiliency (albeit at a potentially high cost).
SANs are often deployed in conjunction with clusters, but they don’t have to be. A SAN shared
between multiple unclustered mailbox servers still offers the benefits I describe above—without the
complexity of clustering. SANs themselves are fairly complex beasts, which is one common (and
sensible) reason why organizations that could use SAN’s performance and flexibility sometimes shy
away from SAN deployments. If you aren’t comfortable setting up, provisioning, and managing a
SAN, being dependent on it can actually leave you worse off than you would have been without it.
Cost is also a factor to consider. Obviously, the actual cost of a given solution varies according to
its specifics, but all this capability doesn’t come cheap. Purchase and maintenance cost is the other
big reason why SANs aren’t more prevalent; many organizations find that they get more business
value from spending their infrastructure dollars in other ways.
Chapter 8:
Clustering Microsoft Exchange Server 2003 servers can potentially improve service levels by reducing
downtime—especially planned downtime, when you have to reboot servers after applying monthly
Microsoft patches. Windows Server 2003 includes enhancements that make setting up and deploying
a cluster much easier than under Windows 2000 Server. If you believe clustering can benefit your
Exchange organization and are ready to get started, this chapter can help guide you through the
cluster-setup process. I’ll explain Exchange clustering basics and the preparatory steps you must take
before building a new two-node Exchange 2003 Service Pack 1 (SP1) cluster running on Windows
2003 SP1. And I’ll explain how to install Exchange 2003 on a Windows 2003 SP1 cluster and
post-installation best practices.
Before you can install an EVS, you must manually create the following cluster resources by using
the Cluster Administrator program:
• a TCP/IP address resource for the EVS
• a network name for the EVS
• disk resources that the EVS uses
Later I’ll outline the steps for creating an Exchange cluster group and the resources that the EVS
requires. When you configure an Exchange 2003 cluster, the Exchange Setup program places all these
resources in a resource group (called an Exchange cluster group). An EVS can’t be split across
separate resource groups, which ensures that the resources and virtual servers all fail over as a single
unit and that resource-group integrity is maintained.
Install layered products that Exchange requires. Exchange requires several Windows
components, which you install via the Add or Remove Programs applet. You need to install these
components on both cluster nodes. To install the components, in Add or Remove Programs click
Add/Remove Windows components, check Application Server, click Details, and check ASP.NET and
Internet Information Services (IIS). Click Details, then check Common Files, NNTP Service, SMTP
Service, and World Web Service. Click OK to close each box and install the components
Install Windows 2003 SP1 and the latest security patches. It’s a good time to install the latest
patches to your cluster since no mailboxes or users are connected to it.
Verify the network connection configurations on each cluster node. Each node has two
network connections: a public connection to the LAN, which Outlook clients use to access the EVS
and administrators use to remotely manage the cluster, and a heartbeat connection, which the cluster
service uses to detect whether a cluster node is online or offline. (The public network can also be
used for heartbeat communication.) Occasionally, Windows clusters can be built with the cluster
heartbeat set at a higher priority in the binding order (i.e., the order in which they’re accessed by
network services) than the public-facing LAN.
Modify the binding order so that the public-facing connection is highest, followed by the
heartbeat, as Figure 1 shows. To check the binding order, in Control Panel, go to Network
Connections, Advanced, Advanced Settings, and select the Adapters and Bindings tab. Standardize the
binding order on each node. Follow the steps described at http://support.microsoft.com/?kbid=258750
to remove NetBIOS from the heartbeat connections, set the appropriate cluster communication
priority, and define the correct NIC speed and duplex mode on the heartbeat connection.
Figure 1
Modifying the binding order
Create cluster groups and the Microsoft Distributed Transaction Coordinator (MS DTC)
resource. Before installing Exchange, you need to create the required cluster groups and resources.
By default, the cluster installation program creates an initial configuration that includes a cluster
resource group containing the cluster IP address resource, a cluster network name, and a quorum
disk resource (the quorum drive is usually assigned drive letter Q). This group is commonly known
as the cluster group. The cluster installation program also creates a cluster resource group for each
disk resource (called Group 0, Group 1, and so on). Figure 2 shows my initial cluster configuration
for DARA-CL1.
Figure 2
An initial cluster configuration
Exchange 2003 requires MS DTC to be configured as a cluster resource. MS DTC is a service based
on the OLE transactions-interface protocol, which provides an object-oriented (OO) interface for initi-
ating and controlling transactions. The Microsoft article at http://www.support.microsoft.com/?kbid=301600
describes the procedure for configuring MS DTC in a cluster environment and recommends placing
MS DTC in a separate cluster resource group with its own disk, IP address, and network name
resources. In my opinion, the advice in this article is appropriate for applications such as Microsoft
SQL Server that make heavy use of MS DTC. Exchange, on the other hand, makes very light use of
MS DTC and unless you’ve deployed some workflow applications that use MS DTC, you can actually
take MS DTC offline without affecting Exchange. (See also some sound advice about this matter
posted by Evan Dodds at http://blogs.technet.com/exchange/archive/category/3896.aspx.) Therefore,
I recommend placing the MS DTC resource in the cluster group (along with the cluster network
name, IP address, and quorum disk resources).
You should also create each EVS in a separate cluster resource group (i.e., an Exchange cluster
group). To create a group, in Cluster Administrator click New, then click Resource Group. For my
EVS, DARA-EVS1, I called my resource group DARA-EVS1-GRP to reflect the EVS’s name. The New
Resource wizard then prompts you for the possible owners (nodes) of this resource group. For a
two-node cluster, add the two nodes as possible owners, as Figure 3 shows.
Next, move disk resources created by the cluster installation program to the new Exchange
resource group. To move a resource, right-click the disk resource and select Move Group. Choose the
Exchange resource group (here, DARA-EVS1-GRP). Repeat this procedure for each resource group
created by the cluster installation program.
Figure 3
Adding nodes as possible owners
You can delete the disk resource groups you changed in the last step (e.g., Group 0, Group 1,
Group 2) because they’re now empty. To delete a resource group, in Cluster Administrator, right-click
the resource group and click Delete. You should now have a cluster group configuration (two
resource groups: a cluster group and a resource group for Exchange) similar to the one that Figure 4
shows.
Figure 4
Cluster group configuration
At this point, you need to create the resources in the Exchange cluster group that each EVS
requires, such as an IP address, a network name, and disk resources. Create the IP address resource
in your Exchange cluster group. To create the IP address resource for the EVS, in Cluster Adminis-
trator, click File, New Resource, which displays the dialog box that Figure 5 shows. Select IP Address
for the resource type and make sure you select the Exchange resource group (not the cluster group)
as the group for the new IP address resource.
Figure 5
New Resource dialogue box
After you click Next, the wizard prompts you to select nodes that can own this resource and to
accept that both nodes can be owners. Next, the wizard asks you to select resource dependencies for
the IP address resource. You don’t need to set any resource dependencies, so click Next. The wizard
then asks you to supply an IP address and subnet mask and choose a network for the IP address.
Make sure you choose Public Network because clients will use this network connection to connect to
the EVS. The heartbeat network is for internal cluster communications only. Click Next, and you
should see the message Cluster Resource created successfully. Bring the IP address resource online by
right-clicking it and clicking Bring Online.
The procedure for creating the EVS network name resource is similar to that for creating the EVS
IP address resource. As before, in Cluster Administrator, click File, New Resource, and select Network
Name from the resource list. Make sure you select the Exchange resource group as the group. Select
the Network Name resource, enter its name, and click Next. The wizard prompts you to accept that
both nodes should be owners of this resource. Click Next. Next, you’re asked to specify dependen-
cies for the Network Name resource. This resource has a dependency on the IP address resource
because the IP address resource must be online before the Network Name can come online. Choose
the IP address resource you created in the previous step and click Next. The wizard now prompts
you to specify the EVS name. At this point, you can also specify whether the EVS name should be
authenticated against AD by using Kerberos authentication and whether it should be registered with a
DNS server. Enable these settings, as Figure 6 shows, and click Next. You should see a message
stating that the Network Name resource was created successfully. At creation time, the Network Name
will be offline. Bring the resource online by right-clicking it and clicking Bring Online.
Figure 6
Network Name Parameters dialogue box
configuration, called the Cluster Diagnostics and Verification Tool (ClusDiag.exe). You can download
the tool at http://www.microsoft.com/downloads/details.aspx?familyid=b898f587-88c3-4602-
84de-b9bc63f02825&displaylang=en.
Document the installation. Document every step of the installation. The documentation shows
other administrators how you built the server and can also be useful for disaster recovery purposes
(i.e., if you have to rebuild the cluster). The easiest way to create documentation is to open a new
WordPad or Microsoft Word document, then for every step of the installation take a screen shot of
the active window by pressing Alt+Print Screen, which copies an image of the window into the
Clipboard. From WordPad, select Edit, Paste Special, and choose Device Independent Bitmap.
Choosing to paste the image as a device-independent bitmap (instead of doing a standard paste)
reduces the document’s size. Store the documentation log somewhere safe (and off the cluster).
Install Exchange 2003 on the first cluster node. Log on to Node 1 by using an account that has
Exchange Full Administrator permissions. Run the Exchange 2003 setup program (located in the
\setup\i386 folder of the Exchange 2003 CD-ROM). You’ll get the error message Exchange Server
2003 has a known compatibility issue with this version of Windows. Ignore this message for now; later
we’ll apply Exchange 2003 SP2 to correct the problem. At the Welcome screen, click Next.
Click I Agree to accept the End User License Agreement (EULA). Under the Action menu, select
Typical. At the next screen, accept the terms of a Per Seat Licensing Agreement. The installation
begins; a completion message is displayed when the installation is done.
Install Exchange 2003 on the second cluster node. Install Exchange 2003 on the second node
(Node 2), repeating the steps you followed for Node 1. You can’t update the binaries to Exchange
2003 SP2 yet; you must create the EVS before you can apply SP2.
Create the EVS. After you’ve installed the Exchange binaries on the cluster nodes, you can create an
EVS. You must create the EVS on the active node (the node that currently owns the EVS) because the
setup program places Exchange databases, transaction logs, and other components on the shared
storage so that each node can access them.
To create the EVS, log on to the active node by using an account that has Exchange Full
Administrator privileges. Open Cluster Administrator; click File, New, Resource; and select Microsoft
Exchange System Attendant. Create this resource in the Exchange cluster group (not the same as the
cluster group), which in our example is DARA-EVS1-GRP, as Figure 7 shows. Click Next.
Figure 7
Creating the EVS
You’re prompted to specify nodes as possible owners for this resource. Both cluster nodes
should be listed as owners by default. Click Next to continue. Next, you’re prompted to supply
dependencies for the Exchange System Attendant resource. Select all the resources (IP address,
network name, and disk resources) in the left pane and click Add to add them as required resources
for the Exchange System Attendant resource. Click Next.
Now you’ll choose an administrative group for your EVS. Select a group and click Next, then
select a routing group for your EVS and click Next. At the next prompt, select a data directory folder
in the shared storage to contain Exchange databases, transaction logs, the SMTP folders, the
full-text–indexing database, and the message-tracking log files. The default location for this folder is
the \exchsrvr folder on a physical disk resource in the Exchange resource group that you created in
Part 1. A limitation of the Exchange setup program is that it places all the components (such as the
databases) in the same folder at installation time. You need to move them manually after the installa-
tion. To help me identify which Exchange components the setup program has placed at installation
time, I usually name the folder by using the convention \exchsrvr_staging-EVSname. (In this cluster
build, I called it \exchsrvr_staging_DARA-EVS1 to indicate that this folder was created at installation
time for that EVS.) Later I’ll explain how you can move these components to other drivers or folders.
The setup program displays a summary screen that lists all the settings, as Figure 8 shows. Click
Finish to accept the installation summary. The Exchange setup program automatically creates
resources for Exchange components such as the Information Store and protocol resources for IMAP
and POP. Upon completion, the setup program displays a message stating that the Exchange System
Attendant cluster resource was created successfully.
Figure 8
A settings summary screen
At this point, you can see in Cluster Administrator that all Exchange resources are offline. Bring
each resource online by right-clicking the Exchange resource and selecting Bring Online. The list of
resources displayed in Cluster Administrator should look similar to the screen that Figure 9 shows.
Figure 9
A list of resources
To verify that the EVS is online, open Exchange System Manager (ESM) and check your adminis-
trative group. DARA-EVS1 appears as a clustered server running Exchange 2003 RTM version 6944.4.
I recommend you now perform a failover test to Node 2 to verify that the EVS can run on both
nodes. To perform a failover by using Cluster Administrator, right-click the Exchange resource group
associated with the EVS (here, DARA-EVS1) and select Move Group. Doing so triggers a failover of
your newly created EVS from Node 1 to Node 2.
Figure 10
Installing Exchange 2003 SP2
Next you need to move the Exchange resource group from Node 2 to Node 1 by performing a
failover (right-click the Exchange resource group and select Move Group). Be aware that you can’t
perform this process from a Cluster Administrator session running on Node 2 because the files
required for the upgrade procedure aren’t yet installed on Node 2. The requirement to run an
additional upgrade procedure for Exchange service packs from Cluster Administrator is new in
Exchange 2003. It was first introduced as part of the cluster upgrade procedure from Exchange 2000
Server to Exchange 2003. The procedure is also required to upgrade a cluster from Exchange 2003
RTM to Exchange 2003 SP1.
The Exchange program files are now upgraded to Exchange 2003 SP2 on Node 1. To finish the
upgrade, log on to Node 1. Right-click the System Attendant resource for the EVS and select Upgrade
Exchange Virtual Server. When the upgrade is done, you should see the message The Exchange
Virtual Server has been upgraded successfully.
Node 2 is running the Exchange 2003 RTM version. Install Exchange 2003 SP2 on Node 2 by
running update.exe. At the Licensing Agreement screen, select I Agree to accept the License
Agreement and click Next. Select Update from the Action column. When the SP2 upgrade is finished
on Node 2, reboot if prompted to do so, and when Node 2 has finished restarting, verify that SP2 has
been installed correctly by moving the Exchange resource group from Node 1 to Node 2 as you did
earlier. As a final test, I recommend you reboot each node in turn, starting with Node 1. The EVS
should fail over from Node 1 to Node 2. When Node 1 has returned online and rejoined the cluster,
reboot Node 2 to test the failover from Node 2 to Node 1. These tests will verify that the Exchange
cluster is configured correctly. After each failover finishes, check the Application log for errors.
Redistribute Exchange components across disk resources. The Exchange cluster Setup
program places the Exchange components in the \data directory folder. (I placed all the components
in H:\exchsrvr_staging_DARA-EVS1 folder when the sample EVS was created.) The following folders
contain the Exchange cluster components for our sample installation:
• E:\exchsrvr_staging_DARA-EVS1\mdbdata contains Exchange .edb files, streaming database (.stm)
files, the checkpoint file, and the transaction logs.
• E:\exchsrvr_staging_DARA-EVS1\mtadata contains the Message Transfer Agent (MTA) folder.
• E:\exchsrvr_staging_DARA-EVS1\mailroot contains the folder structures that the SMTP Virtual
Server uses.
• E:\exchsrvr_staging_DARA-EVS1\exchangeserver_servername contains the full-text–indexing
database associated with the EVS.
• E:\exchsrvr_staging_DARA-EVS1\servername.log contains the message-tracking log files.
Place transaction logs and Exchange databases on separate drives. Placing these entities on
separate drives is a long-established Microsoft best practice, which I strongly advise you to adhere to.
Doing so will help your Exchange server perform better. (The Information Store process writes each
transaction to the database and transaction logs; splitting them across different physical drives
distributes the load on the storage.) More important, though, placing the transaction logs on a drive
physically separate from the database lets you recover data from a backup if you lose the drive that
holds your databases. For more information about placing databases and transaction logs on separate
drives, follow the instructions in the Microsoft article “How to move Exchange databases and logs in
Exchange Server 2003” (http://support.microsoft.com/?kbid=821915).
Move SMTP folders. To move SMTP folders, follow the instructions in the Microsoft article “How to
change the Exchange 2003 SMTP Mailroot folder location” (http://support.microsoft.com/?kbid=822933).
Move the indexing files. To move the full-text–indexing property store and property store logs,
follow the instructions at http://www.microsoft.com/technet/prodtechnol/exchange/guides/
workinge2k3store/e1ea3634-a2c0-40e6-ad50-e9e988ae4728.mspx and in the Microsoft article “XADM:
Recommendations for Using Content Indexing Utilities (Pstoreutl or Catutil) in a Cluster Environment”
(http://support.microsoft.com/?kbid=294821). You use two utilities to move the indexing files:
pstoreutl.exe, which moves the property store to another drive location, and catutil.exe, which moves
the catalog (index) to another drive location.
Back up the cluster. After you’ve moved the necessary components, perform a full backup of your
cluster by using NTBackup. Back up the drives in your local storage and the system state and also
perform a full Exchange database backup.
Install third-party products. Install third-party layered products such as file-based antivirus
software, Exchange-aware antivirus software, and monitoring software. Take care to exclude folders
that contain Exchange databases and transaction logs from a file-based virus scanner because such
scanners can corrupt Exchange databases. (For more information, see the Microsoft article “Overview
of Exchange Server 2003 and antivirus software” at http://support.microsoft.com/?kbid=823166.)
Perform tests. Create a test mailbox on your cluster and create a Microsoft Outlook profile that has
Cached Exchange Mode enabled. Perform some failover tests and record the time it took for the EVS
to go offline and online between cluster nodes. You’ll find this information useful for planning future
maintenance. As you add mailboxes to the cluster, failover times might increase because more
connections will be open to the EVS. On production clusters that have several hundred active client
connections, I’ve seen failovers take 2 to 10 minutes. Failover times depend on many different factors,
such as the hardware specification of cluster nodes, performance of the Exchange storage subsystem,
and number of active connections.
Run Exchange Server Best Practices Analyzer (ExBPA). Run ExBPA against your newly
installed cluster to verify your installation. ExBPA is cluster aware and will analyze the configuration
of the cluster and EVS and generate a report like the one that Figure 11 shows. You can download
ExBPA at http://www.microsoft.com/downloads/details.aspx?familyid=dbab201f-4bee-4943-ac22-
e2ddbd258df3&displaylang=en.
Figure 11
Sample ExBPA report
Setting Permissions
Microsoft built many security enhancements into Exchange 2003, and a number of these improvements are cluster-
specific. In Exchange 2000 Server, the cluster service account required Exchange Full Administrator rights at the
Administrative Group level to create an EVS. With Exchange 2003, the cluster service account doesn’t require any
Exchange-specific permissions delegated to the Administrative Group. To install an EVS, the account you use to
perform the installation must be delegated Exchange Full Administrator rights at the Administrative Group level (or be
a member of a security group that’s been delegated Exchange Full Administrator rights). If the EVS is the first EVS to
be installed in your Exchange organization, the account you use to perform the installation also requires Exchange
Full Administrator rights at the Organization level. (For more information and best practices for setting Exchange
permissions, see the document Working with Active Directory Permissions in Microsoft Exchange 2003 at
http://www.microsoft.com/downloads/details.aspx?familyid=0954b157-5add-48b8-9657-
b95ac5bfe0a2&displaylang=en.)