Tuesday, February 06, 2007

Using Decision Automation in Disaster Recovery

DM Review recently published my article on “Using Decision Automation in Database Administration” (http://www.datawarehouse.com/article/?articleId=6876). Right after that, I get this interesting question from a reader enquiring about leveraging decision automation in disaster recovery (DR). Specifically, she asks “a good application I can think of for decision automation is disaster recovery. Can you outline a model for how one could go about implementing it?”. I wanted to share my response with everyone. So here I go…

Once the right DR model is designed, automation in general, is invaluable in rolling it out and maintaining it; ensuring it works as per spec ("spec" being the enterprise service level requirements pertaining to uptime, failover and performance). However the one area that decision automation specifically is suited for is establishing a centralized command / control mechanism to allow failover to occur under the right circumstances, and in the process, dealing with false alarms.

The foremost (and most natural) reaction of many IT administrators when they face an outage (even a simple disk-array failure, much less a true disaster) is panic. During this state, the last thing you want them to do is think on their feet to figure out whether to fail over to the DR system and if so, go about making that happen manually. Especially, if there are other unanticipated hiccups in the process (which happen more often than not at that very time, given Murphy's law), that usually results in service level violations or worse, a failed DR initiative.


To better appreciate this, imagine a nation’s nuclear arsenal being subject to manual controls and individual whims rather than a fully automated and centralized command / control system. If relying on a human being (be it the president or prime minister or your local fanatic dictator) or worse, multiple humans to make the call re: launch and then deploy manually, the system would be prone to mood swings , and other assorted emotions and errors resulting in the weapons being deployed prematurely, or not at all even when the situation calls for it. Further, once the decision is made to deploy or not deploy, manual errors can delay or prevent the required outcome.

These are the situations where decision automation can come to the rescue. Decision automation can completely do away with the manual aspects of deciding whether or not to initiate failover and how (partial or full). It can coldly look at the facts it has accumulated and allow pre-determined business logic to figure out next steps and initiate an automated response sequence as required. Let's look at what that means.

Let's say the service level agreement requires a particular system or database to provide 99.999% uptime (five 9s). That means, that database can be down for a maximum of 5 minutes of unplanned downtime during the year.

Let’s make the problem even more interesting. Let's pretend the company is somewhat cash-poor. Instead of spending gobs of money on geo-mirroring solutions like EMC SRDF and other hardware/network level replication (even “point-in-time copy” generators like EMC’s SnapView, TimeFinder or NetApp’s SnapVault), it has invested in a simple 2-node Linux cluster to act as the primary system and two other SMP machines to host a couple of different standby databases. The first standby database is kept 15 minutes behind the primary database and the other standby remains 30 minutes behind the primary. (The reason for the additional delay in the latter case is to provide a longer window of time to identify and stop any user errors or logical corruption from being propagated to the second standby.) The primary cluster is hosted in one location (say, Denver) and the other two standby databases are being hosted in geographically disparate data centers with data propagation occurring over a WAN.

Given this scenario, at least two different types of failover options emerge:
- Failover from the Primary cluster to the first standby
- Failover from the Primary to the second standby

The underlying mechanisms to initiate and process each option is different as well. Also, given that the failover occurs from two-node clustered environment to a single node, there may be a service level impact that needs to be considered as well. But first things first!

Rolling out this DR infrastructure should be planned, standardized, centralized and automated (not “scripted”, but automated via a robust run book automation platform). You may think, why the heck not just do it manually? Because building any DR infrastructure, no matter how simplistic or arcane it is, is never a one-time thing! (Not unlike most IT tasks…). Once failover occurs, failback has to follow eventually. The failover/failback cycle has to be repeated multiple times during fire-drills as well as during an actual failure scenario. Once failover to a secondary database happens, the primary database has to be rebuilt and restored to become a secondary system itself or in the above situation since it's a cluster (presumably provides more horsepower and the system when it fails over to a single node has to operate in reduced capacity), failback has to happen to reinstate the cluster as the primary server. So in the frequent cycles of failover, rebuilding and failback, you don't want people to keep doing all these tasks manually. The inconsistency in quality of work (depending on which DBA is doing which task) could cause inefficiencies to creep in resulting in a flawed DR infrastructure. Human errors could compromise the validity of the infrastructure.

So the entire process of building, testing, maintaining, and auditing the DR infrastructure needs to be standardized and automated. Also included in the “maintenance” part is the extraction of the archived/transaction logs every 15 minutes, propagating to the first standby server and applying it. Similarly, in the case of the second standby, the logs have to be applied every 30 minutes to ensure the fixed latency is maintained. In the case of certain versions of certain database platforms (namely, Oracle and DB2), there are DBMS supplied options to establish the same. Regardless, the right mechanisms and components have to be chosen and deployed in a standardized and automated manner. Period.

Now, the DR infrastructure could be well deployed, tested and humming, but during the critical time of failover, it needs to be evaluated what kind of failure is being experienced, what is the impact to the business, how best to contain the problem symptoms, transition to a safe system and to deal with the impact on performance levels. Now, applying that to our example, it needs to determined:
- Is the problem contained in a single node of the cluster. To use an Oracle example, if it’s an instance crash, then services (including existing connections, in some cases such as via Transparent Application Failover or TAF) can be smoothly migrated to the other cluster node.


- If the database itself is unavailable due to the shared file-system having crashed ( in spite of its RAID configuration), then the cluster has to be abandoned and services need to be transferred to the first standby (the one that’s 15 mins behind). As part of the transition, any newer archived logs files that haven’t yet been applied need to be copied and applied to the first standby. If parts of the filesystem that hold the online redo logs are not impacted, they need to be archived as well to extract the most current transaction(s), copied over and applied. Once the standby is synchronized to the fullest extent possible, it needs to be brought up along with any related database and application services such as the listener process. Applications need to be rerouted to the first standby (which is now the primary database) either via implicit rerouting by reconfiguring any middleware or via explicit rerouting (caused by choosing a different application configuration file/address or even, lower level mechanisms such as IP address takeover, wherein the standby server will take over the public IP address of the cluster).

- If for any reason, the first standby is not reliable (say, logical corruption has spread to that server as well or it is not starting up in the manner expected due to other problems), the decision needs to be made to go to the second standby, carry out the above described process and bring up the necessary services.

The real challenge is not just having to make all these decisions on the fly based on the type and scope of failure, and the state of the different primary and standby systems , it is having to make them and complete the entire failover process within the required time (5 minutes, in our example). Organizations cannot realistically expect an panic stricken human being to carry out all this quickly on the fly (yet, ironically, that’s exactly what happens 80% of the time!).

IT professionals sometimes say, “I don’t trust automation; I need to be able to do things myself to be able to trust it.” Well, would YOU entrust such a delicate process wrought with constant analyzing/re-analyzing and ad-hoc decision making to a human being who may be sick or on vacation and out of the office? Someone who may have left the company since then? Would you be content merely allowing this process to be documented and hoping someone else (other than the person who designed and built the solution) in the IT team can perform it without problems? I for one, wouldn’t.

The range of situations for applying decision automation are manifold and implementing a centralized command / control system for initiating failover in the DR process just happens to be an ideal candidate.