Dag 2013

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Agenda

Storage
High Availability
Site Resilience

Storage

Storage Challenges
Disks
Capacity is increasing, but IOPS are not

Databases
Database sizes must be manageable

Database Copies
Reseeds must be fast and reliable
Passive database copy IOPS are inefficient
Lagged copies have asymmetric storage requirements

require manual care

Storage Innovations

Multiple Databases Per Volume


Autoreseed
Self-Recovery from Storage Failures
Lagged Copy Innovations

Multiple database per volume

Multiple databases per


volume
4-member DAG
4 databases
4 copies of each
database
4 databases per
volume
Symmetrical design
DB1
DB1

DB2
DB2

DB3
DB3

DB4
DB4

DB4
DB4

DB1
DB1

DB2
DB2

DB3
DB3

DB3
DB3

DB4
DB4

DB1
DB1

DB2
DB2

DB2
DB2

DB3
DB3

DB4
DB4

DB1
DB1

Passi Lagge
Passi
Lagge
Active
Active
ve
d
ve
d

Multiple databases per


volume
Single database copy/disk:
Reseed 2TB Database = ~23 hrs
Reseed 8TB Database = ~93 hrs

DB1
DB1

20 MB/s DB1
DB1

DB1
DB1

DB1
DB1

Passi Lagge
Passi
Lagge
Active
Active
ve
d
ve
d

Multiple databases per


volume
Single database copy/disk:
Reseed 2TB Database = ~23 hrs
Reseed 8TB Database = ~93 hrs
4 database copies/disk:
Reseed 2TB Disk = ~9.7 hrs
Reseed 8TB Disk = ~39 hrs
DB1
DB1

12 MB/s DB2
DB2

20 MB/s DB3
DB3

20 MB/s DB4
DB4

DB4
DB4

12 MB/s DB1
DB1

DB2
DB2

DB3
DB3

DB3
DB3

DB4
DB4

DB1
DB1

DB2
DB2

DB2
DB2

DB3
DB3

DB4
DB4

DB1
DB1

Passi Lagge
Passi
Lagge
Active
Active
ve
d
ve
d

Multiple databases per


volume
Requirements
Single logical disk/partition per physical disk

Recommendations
Databases per volume should equal the number of copies per database
Same neighbors on all servers

Autoreseed

Seeding Challenges
Disk failure on active copy = database failover
Failed disk and database corruption issues need to be

addressed quickly
Fast recovery to restore redundancy is needed

Seeding Innovations

Automatically restore redundancy after disk failure

using provisioned spares


In-Use Storage

X
ed
-se
e
r
n
k
Dis eratio
p
o

Spares

Autoreseed Workflow

Autoreseed Workflow

1. Detect a copy in an F&S state for 15 min in a row


2. Try to resume copy 3 times (with 5 min sleeps in between)
3. Try assigning a spare volume 5 times (with 1 hour sleeps

in between)
4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with
1 hour sleeps in between)
5. Once all retries are exhausted, workflow stops
6. If 3 days have elapsed and copy is still F&S, workflow
state is reset and starts from Step 1

Autoreseed Workflow
Prerequisites
Copy is not ReseedBlocked or ResumeBlocked
Logs and database file(s) are on same volume
Database and log folder structure matches required naming convention
No active copies on failed volume
All copies are F&S on the failed volume
No more than 8 F&S copies on the server (if so, might be a controller
failure)
For InPlaceSeed
Up to 10 concurrent seeds are allowed
If a database files exists, wait for 2 days before in-place reseeding

Waiting period based on LastWriteTime of database file

Autoreseed

AutoDagDatabasesRootFolde
rPath

AutoDagVolumesRootFolderPath

ExchDb
s

ExchVo
ls

Vol1
MDB1

AutoDagDatabaseCopiesPerVolu
me = 1

MDB1.D
MDB1.D
B
B

Vol2 Vol3

MDB1

MDB2

MDB2

MDB1.log
MDB1.log

MDB1.D
MDB1.D
B
B

MDB1.log
MDB1.log

Autoreseed
Requirements
Single logical disk/partition per physical disk
Specific database and log folder structure must be used
Recommendations
Same neighbors on all servers
Databases per volume should equal the number of copies
per database

Autoreseed
Numerous fixes in CU1
Autoreseed not detecting spare disks correctly
Autoreseed not using spare disks
Increased Autoreseed copy limits (previously 4, now 8)
Better tracking around mount path and ExchangeVolume path
Get-MailboxDatabaseCopyStatus displays ExchangeVolumeMountPoint

Shows the mount point of the database volume under C:\ExchangeVolumes

Update-MailboxDatabaseCopy includes new

parameters Description
designed to aid with automation

Parameter
BeginSeed

Useful for scripting reseeds. Task asynchronously starts the seeding


operation and then exits the cmdlet.

MaximumSeedsInPara Used with Server parameter to specify maximum number of parallel


llel
seeding operations across specified server during full server reseed
operation. Default is 10.
SafeDeleteExistingFile Used to perform a seeding operation with a single copy redundancy pres
check prior to the seed. Because this parameter includes the redundancy
safety check, it requires a lower level of permissions than
DeleteExistingFiles, enabling a limited permission administrator to perform
the seeding operation
Server

Used as part of a full server reseed operation to reseed all database copies
in a F&S state. Can be used with MaximumSeedsInParallel to start reseeds
of database copies in parallel across specified server in batches of up to
value of MaximumSeedsInParallel parameter copies at a time

Self-recovery from storage failures

Recovery Challenges

Storage controllers are basically mini-PCs


As such, they can crash, hang, etc., requiring administrative
intervention
Other operator-recoverable conditions can occur
Loss of vital system elements
Hung or highly latent IO

Lagged copy
innovations

Lagged Copy Challenges


Activation is difficult
Lagged copies require manual care
Lagged copies cannot be page patched

Lagged Copy Innovations


Automatic log file replay
Low disk space (enable in registry)
Page patching (enabled by default)
Less than 3 other healthy copies (enable in Active Directory; configure in
registry)
Integration with Safety Net
No need for log surgery or hunting for the point of corruption

High Availability

High Availability Challenges


High availability focuses on database health
Best copy selection insufficient for new architecture
DAG network configuration still manual

High Availability Innovations


Managed Availability
Best Copy and Server Selection
DAG Network Autoconfig

Managed Availability

Managed Availability
Key tenets for Exchange 2013
Access to a mailbox is provided by protocol stack on the
Mailbox server that hosts the active copy of the mailbox
If a protocol is down on a Mailbox server, all access to
active databases on that server via that protocol is lost
Managed Availability was introduced to detect and

automatically recover from these kinds of failures


For most protocols, quick recovery is achieved via a

restart action
If the restart action fails, a failover can be triggered

Managed Availability
An internal framework used by component teams
Sequencing mechanism to control when recovery

actions are taken versus alerting and escalation


Enhances the Best Copy Selection algorithm by taking
into account overall server health of source and target

Managed Availability

MA failovers are recovery action from failure


Detected via a synthetic operation or live data
Throttled in time and across the DAG
MA failovers can happen at database or server level
Database: Store-detected database failure can trigger database
failover
Server: Protocol failure can trigger server failover
Single Copy Alert integrated into MA
ServerOneCopyInternalMonitorProbe (part of DataProtection Health
Set)
Alert is per-server to reduce flow
Still triggered across all machines with copies
Logs 4138 (red) and 4139 (green) events

Best Copy and Server Selection

Best Copy Selection


Challenges

Exchange 2010 used several criteria


Copy queue length
Replay queue length
Database copy status including activation blocked
Content index status

Using just this criteria is not good enough for

Exchange 2013, because protocol health is not


considered

Best Copy and Server


Selection
Still an Active Manager algorithm performed at *over

time based on extracted health of the system


Replication health still determined by same criteria
and phases
Criteria now includes health of entire protocol stack
Considers a prioritized protocol health set in the selection

using four priorities critical, high, medium, low


Failover responders trigger added checks to select a
protocol not worse target

Best Copy and Server


Selection
1
2
3
4

Managed Availability
imposes 4 new
constraints on the
Best Copy Selection
algorithm

BCSS Changes in CU1


PAM tracks number of active databases per server
Honors MaximumActiveDatabases, if configured
Allows Active Manager to exclude servers that are already hosting the
maximum amount of active databases when determining potential
candidates for activation
Keeps an in-memory state that tracks the number of active databases per
server
When the PAM role moves or when the Exchange Replication service is
restarted on the PAM, this information is rebuilt from the cluster database

DAG Network
Innovations

DAG Network Challenges

DAG networks must be manually collapsed in a multi-

subnet deployment
Small remaining administrative burden for deployment
and initial configuration

DAG Network Innovations


Automatically collapsed in multi-subnet environment
Automatic or manual configuration
Default is Automatic
Requires specific settings on MAPI and Replication
network interfaces
Manual edits and EAC controls blocked by default
Set DAG to manual network setup to edit or change DAG
networks

Site Resilience

Site Resilience Challenges


Operationally complex
Mailbox and Client Access recovery connected
Namespace is a SPOF

Site Resilience Innovations


Key Characteristics
DNS resolves to multiple IP addresses
Almost all protocol access in Exchange 2013 is HTTP
HTTP clients have built-in IP failover capabilities
Clients skip past IPs that produce hard TCP failures
Admins can switchover by removing VIP from DNS
Namespace no longer a SPOF
No dealing with DNS latency

Site Resilience Innovations


Operationally simplified
Mailbox and Client Access recovery independent
Namespace provides redundancy

Site Resilience
Operationally Simplified
Previously loss of CAS, CAS array, VIP, LB, etc., required admin to
perform a datacenter switchover
In Exchange Server 2013, recovery happens automatically

The admin focuses on fixing the issue, instead of restoring service

Site Resilience
Mailbox and CAS recovery independent
Previously, CAS and Mailbox server recovery were tied together in site
recoveries
In Exchange Server 2013, recovery is independent, and may come
automatically in the form of failover

This is dependent on business requirements and configuration

Site Resilience
Namespace provides redundancy
Previously, the namespace was a single point of failure
In Exchange 2013, the namespace provides redundancy by leveraging
multiple A records and clients OS/HTTP stack ability to failover

Site Resilience
Support for new deployment scenarios
With the namespace simplification, consolidation of server roles,
separation of CAS array and DAG recovery, de-coupling of CAS and
Mailbox by AD site, and load balancing changes, if available, three
locations can simplify mailbox recovery in response to datacenter-level
events
You must have at least three locations
Two locations with Exchange; one with witness server
Exchange sites must be well-connected
Witness server site must be isolated from network failures affecting Exchange sites

Site Resilience Failover Examples

Site Resilience Failover


Examples

IP from
DNS namespace,
puts you in control
of infails,
service
time
of VIP
With multiple VIPRemoving
endpointsfailing
sharing
the same
if one VIP
clients
automatically
failover to
alternate VIP(s)

mail.contoso.com:
mail.contoso.com:
192.168.1.50,
10.0.1.50
10.0.1.50

VIP: 10.0.1.50

VIP: 192.168.1.50

cas1

cas2

primary datacenter: Redmond

cas3

cas4

alternate datacenter: Portland

Site Resilience Failover


Examples
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file,
automatic failover of active databases should occur

X
mbx1

mbx2

primary datacenter:
Redmond

dag1

witnes
third datacenter:
s
Paris

mbx3

mbx4

alternate datacenter:
Portland

Site Resilience Failover


Examples

X
mbx1

mbx2

dag1

XX
mbx3

mbx4

witness

primary datacenter: Redmond

alternate datacenter: Portland

Site Resilience Failover


Examples

1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 ActiveDirectorySite:Redmond


2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
nd datacenter: Restore-DatabaseAvailabilityGroup DAG1
3. Activate DAG members in 2nd
ActiveDirectorySite:Portland

X
mbx1

mbx2

witness

primary datacenter: Redmond

dag1

mbx3

mbx4
alternate
witness

alternate datacenter: Portland

You might also like