On each Exchange 2010 mailbox server there is a process running inside the Microsoft Exchange Replication Service (MSExchangeRepl) called Active Manager which is the high availability brain in Exchange 2010. Active Manager manages failover and switchover (known as *over) on DAG members by selecting the best database copy to be activated. It also frequently checks the Active Directory topology on each mailbox server for any changes.
On each DAG member we find either a Primary Active Manager (PAM) or a Standby Active Manager (SAM). All DAG members will have a SAM except on the node that owns the quorum that will be running a PAM. One of the Active Manager functions is to monitor the database and information store health. If the database fails on a DAG member, SAM informs the PAM about this failure to take action.
SAM also responds to queries from Client Access servers and Hub Transport servers for the active mailbox database owner. When a user connects to a CAS, the CAS queries the Active Manager on a DAG member for the active mailbox database where the user mailbox is located, to connect the user to. The SAM responds with the server on which the database is active.
If the quorum owner failed, the Active Manager on the node that obtains the quorum becomes the Primary.
If an active mailbox database fails, and if the database is configured for replication, the PAM is notified to select the best database copy to be activated on another node. This is done as follows:
The Best Copy Selection (BCS) algorithm is run.
After selecting the best copy, the PAM notifies the selected server to become the next database master.
The Microsoft Exchange Replication Service on the selected server will try to copy the logs from the previous active database master through the Attempt Copy Last Logs (ACLL) process. ACLL will query other servers where there is a healthy database copy with the highest log generation number by checking the LastInspectedLogTime.
After completing the ACLL, the PAM notifies the selected server to mount the database. If all the logs were copied successfully, the database mounts without any data loss. Otherwise if some of the logs could not be copied then the database will only be mounted if the number of missing logs (copy queue length) is less than the value configured for the AutoDatabaseMountDial parameter.
The AutoDatabaseMountDial
parameter can be set to BestAvailability
, GoodAvailability
or Lossless
. If the value is BestAvailability, which is the default, the missing logs must be less than or equal to 12 for the database to be mounted automatically. If the value is GoodAvailability the missing logs must be less than or equal to 6 for the database to be mounted automatically. If the value is set to Lossless all the logs should be copied to the selected server for the database to be mounted, in other words, the number of the missing logs must be 0.
If the database could not be mounted on the selected server, the next candidate mailbox server obtained by the BCS process will be selected (if any). If there is no other mailbox server to be selected, the administrator has to manually mount the database and accept the data loss.
There are other reasons because of which the mailbox database might not be mounted on the selected server. A property can be configured for each DAG member to limit the number of simultaneous active databases on it. If the limit is reached, no other database copy can be activated or mounted on this server and the PAM will repeat the process of selecting the next database master again. This limit can be configured using the Set-MailboxServer
cmdlet with the -MaximumActiveDatabases
parameter.
Set-MailboxServer -Identity <MailBoxServer> -MaximumActiveDatabases <number>
Another case when the server selected to be the next database master will not automatically activate and mount the database copy is that when the automatic database activation is disabled on the server; i.e. the DatabaseCopyAutoActivationPolicy
for the server is set to Blocked
using the command:
Set-MailboxServer -Identity <MailBoxServer> -DatabaseCopyAutoActivationPolicy Blocked
Now after we described what happens after the best copy is selected, we need to know how the PAM selects that best copy. What is behind the BCS algorithm? This is what we will discuss next.
Best Copy Selection process
The BCS algorithm results in a list of database copies that represent a good candidate for activation. This list is then sorted based on (in order):
Primary key:
- The lowest Copy Queue Length CQL (the highest LastLogInspected)
- The lowest Reply Queue Length RQL
- Content Index CI status (Healthy or Crawling)
Secondary Key:
- The lowest Activation Preference (which you specify when adding a database copy to the mailbox database)
The previous selection criteria is applicable for both Exchange 2010 RTM and Service Pack 1. But in Exchange 2010 SP1 if the AutoDatabaseMountDial parameter for the mailbox server is set to Lossless then the list is sorted based on the Activation Preference as a primary key.
There will be ten possibilities based on the above mentioned criteria, they can be summarized as:
- ( CQL < 10 ) and ( RQL < 50 ) and ( CI is Healthy )
- ( CQL < 10 ) and ( RQL < 50 ) and ( CI is Crawling )
- ( RQL < 50 ) and ( CI is Healthy )
- ( RQL < 50 ) and ( CI is Crawling )
- ( RQL < 50 )
- ( CQL < 10 ) and ( CI is Healthy )
- ( CQL < 10 ) and ( CI is Crawling )
- ( CI is Healthy )
- ( CI is Crawling )
- If none of the nine set of criteria are met by the database copies, then the PAM will try to activate a database with a status of Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource.
CQL - Copy Queue Length
RQL - Reply Queue Length
CI - Content Index
Example
In our example we have a DAG with members (EX14MBX1 - 4). The mailbox database Main-DB01 has copies on these four members. The status of the database copies is as shown in the following figure:
Main-DB01 is currently active on mailbox server Ex14MBx1. If Ex14MBx1 failed and the content index status of the database copies on the other three mailbox servers (using the command Get-MailboxDatabaseCopyStatus
) is:
Also let's assume that the AutoDatabaseMountDial for all servers is set to GoodAvailability (copy queue length must be less than or equal to 6), and server Ex14MBx2 is configured to host no more than two active mailbox databases at the same time (MaximumActiveDatabases is set to 2).
Now if server Ex14MBx1 fails, the PAM, after being notified, will start the BCS process:
Because the AutoDatabaseMountDial is not set to lossless a list of servers based on the lowest copy queue length is created. The list will be: Ex14MBx2, Ex14MBx4, Ex14MBx3
-
The list will be sorted based on the above mentioned criteria.
- Ex14MBx2: (CQL<10), (RQL<50) and (CI is Crawling) - match criteria 2
- Ex14MBx3: (CQL<10), (RQL<50) and (CI is healthy) - match criteria 1
- Ex14MBx4: (CQL<10), (RQL>50) and (CI is healthy) - match criteria 6
So the resulted list is
- Ex14MBx3
- Ex14MBx2
- Ex14MBx4
ACLL will now try to copy the missing log files from the previous database master Ex14MBx1 to the first server in the list which is Ex14MBx3. However Ex14MBx1 is down and cannot be contacted. Furthermore since the AutoDatabaseMountDial is configured to GoodAvailability (copy queue length must be less than or equal to 6) and the actual copy queue length on server Ex14MBx3 is 8 (see above screen shoot) then Ex14MBx3 will not activate Main-DB01.
The next server in the list Ex14MBx2 will be notified. The copy queue length is 5 so this database can be mounted on this server but Ex14MBx2 is configured to activate only 2 databases (MaximumActiveDatabases = 2) so it will not be activated on this server.
Server Ex14MBx4 will be notified. Copy queue length is 5 and content index status is healthy. ACLL will notify PAM that the process succeeded on server Ex14MB4. PAM will notify Ex14MB4 to mount and activate the mailbox database Main-DB01.
In case none of the database copies can be mounted, then the administrator has to manually activate a database copy on one of the servers and accept the data loss.
Summary
Today we looked at the Active Manager role. Running on each Exchange 2010 mailbox server, it controls High Availability and monitors Database Availability Groups . On DAG members it controls switchover and failover by selecting the best candidate to activate the database copy on it based on a set of criteria. Finally we saw an example describing how the active manager selects the best database copy to mount and activate.
References
Understanding Active Manager
Switchovers and Failovers
Mailbox Server Cmdlets