Disaster Recovery Plan Analysis
By: Mark Bole © 1992, 2004 BIN Computing
Disaster: a sudden or great misfortune; unforeseen mischance bringing with it destruction of life or property.
To plan for disaster recovery first requires identification of possible disasters. The table below summarizes a description of disasters which threaten us.
Following is an analysis of the recovery actions to be performed for each type of disaster situation.
Disaster Recovery
In any disaster, there are three main functions to be performed:
The first two functions apply generically for every situation. Individual recovery actions for each type of disaster will then be described.
Discussion of the overall approach: some disaster recovery planning methodologies depend on extremely detailed write-ups of various steps to follow in a given situation. While superficially re-assuring, these large documents in practice are costly to maintain, are not necessarily likely to be available when and where they are needed in an emergency, and in any event still require live, hands-on training! Given the highly technical and unique nature of every computer application, and APPLICATION in particular, a strong investment in training, rather than rote procedure development, is the approach preferred here. In other words, we are better off with a significant number of staff who can analyze and creatively respond to any situation, rather than a thick document which describes every situation except the one currently at hand.
Communication
First and foremost, the APPLICATION system operations staff needs to keep one another informed as to current plans, developments, etc., to avoid efforts that are redundant or even at cross-purposes. Next, other APPLICATION support personnel, plus THE COMPANY network support personnel (if applicable) need to be kept informed. Finally, non-technical management needs to be apprised of high-level decisions to be made. Methods for communication include face to face, telephone, electronic mail, commercial electronic mail (outside THE COMPANY), fax, etc.
Plan: at present, a partial list of essential contact information (home phone numbers, for example) has been assembled. A fuller list, including home addresses, email IDs, remote "rendezvous points" (either physical or telecommunicable), will be assembled. Commercial email accounts will be obtained. Staff will be reminded regularly to give communications a high priority during any disaster situation. .
Coordination
In a disaster, it will become necessary very quickly to determine who is "in charge" for the purpose of making risky technical decisions on short notice. In a crisis, there can be honest differences of opinion; however, to make quick progress, participative decision making may need to temporarily yield to more authoritative styles.
Plan: a "call-out" list will be developed, ranking technical support individuals in order of seniority for the purposes of urgent disaster recovery. A procedure will be designed so that any individual, not knowing whether others are available, or even if they are aware a disaster exists, will be able to determine in a reasonable time who is the disaster recovery coordinator in charge.
Elements subject to disaster (the "Affects" column):
Types of Disasters |
|||||
Disaster[1] |
Affects |
Expected Result:[2] |
|||
Server hardware |
Network connectivity |
Application software and data |
|||
I. Earthquake, flood, hurricane,
war, riot, insurrection, epidemic, labor strike, power failure. |
Access blocked; possible physical damage |
Unreliable connectivity. |
Machine-readable media (tapes) possibly destroyed or unaccessable. |
Some or all users cannot use application. Loss of capital equipment. Estimated duration: between 4 hours and 2 weeks |
|
II. Fire, vandalism, accidental
damage, mechanical failure. |
Possible physical damage |
Unreliable connectivity. |
Machine-readable media (tapes) possibly destroyed or unaccessable. |
Some or all users cannot use APPLICATION,. Loss of capital equipment. Estimated duration: between 4 hours and 4 days. |
|
III. "Hackers" with malicious
intent, employee alteration or theft of data for personal gain or
vengeance, or accidental software bugs (including failures in interface
with other systems). |
Unaffected. |
Unreliable connectivity. |
Inaccurate or missing business data. System performance (reponse time) degradation. |
Financial loss due to bad decisions based on faulty information unfair advantage for competitors missed payroll, or reduction in employee productivity. Estimated duration: one minute to several years. |
|
IV. Unavailability of key personnel
due to illness, injury, death, kidnapping, or termination of employment. |
Delay in maintenance and upgrades. |
Delay in maintenance and upgrades. |
Delay in maintenance and upgrades. |
No immediate affect; increasing chance of financial loss and/or unavailability of APPLICATION depending on duration. Duration: 1 day to two weeks. |
|
V. Legal, regulatory, or
organizational prohibition from using APPLICATION. |
Unaffected |
Unaffected |
Partial or complete lack of access to APPLICATION. |
All users cannot use some or all of
APPLICATION. |
Recovery actions
Type I disaster: The most obvious and best recovery from the first type of disaster involves purchasing and maintaining an "off-site" server capable of running a full or "stripped" copy of APPLICATION. There is no resource available to create a "stripped" copy of our application, therefore an off-site machine with the same capacity as our production system is the only viable option. Such a machine would cost several hundred thousand dollars to purchase, and a significant fraction of an FTE to maintain in a ready-to-use condition.
Plan: In the event of a less-severe Type I disaster, there are a few options. First, off-site (Bay Area only) backup tapes are currently maintained by a comercial data storage service. These can be used if and when replacement equipment is available to restore the system to an operational state. Instructions for accessing these tapes will be made part of the communication plan (see above). Then, negotiations will proceed with our two primary hardware vendors to provide an "emergency spare" type of service, in other words, for a premium payment, they will agree to store off-site, dedicated replacement hardware compatible with our system. This is similar to purchasing an off-site server (as described above), but is done using expense dollars rather than capital dollars.
To make this plan work in any case requires a great degree of "depth on the bench" for the system operations staff. To facilitate training for all key personnel, drills involving partial and full system recovery to a machine other than our production servers will be designed and carried out
Finally, although of limited use (and a potential security risk), telephone dial-in access can be provided directly into production servers, possible allowing login access even when physical or network access is unavailable. While this will not meet the needs of APPLICATION users, it may speed eventual recovery onto another system
Type II disaster: this is very similar to a Type I disaster, only the chances of rapid recovery are greater. Also, it may be possible (subject to an economic analysis) to keep several spare disk drives off-site (as opposed to an entire server), the idea being that disk drives are one of the most likely sources of hardware failure.
Plan: identical to Type I disaster recovery plan.
Type III disaster: this type of disaster is both easier to handle and harder to detect than the first two. The worst case is to have an intrusion into the system software that goes on undetected for a long period of time, for even when it is eventually discovered, it is likely to be impossible to identify, let alone recover, what data may have been lost or corrupted. However, once a software "bug" (accidental or deliberate) is clearly identified, it is normally a fairly straightforward matter to fix it.
This type of disaster also lends itself to a number of prophylactic measures. Already in place in APPLICATION are both mangement and technical means for keep users honest and accountable, and for detecting and correcting software bugs (including interface failures with other systems). The management approach involves getting the employee and their immediate supervisor to sign an acknowledgement of their responsibility to treat the application and data as a valuable corporate asset; the technical approach involves such standard activities as requiring password changes, disabling unused accounts, keeping users from accessing parts of the system they have no need to access, etc. Also, regarding accidental bugs, system test and other "quality assurance" procedures are in place to catch these. Incremental backups, which allow the system to be restored to its state at many different points in the past, also help provide flexibility to restore date which has been corrupted.
Plan: Continue to enforce and expand managerial and technical controls on who has what types of access levels to APPLICATION. Continue to improve system testing and other quality assurance procedures. Mangement input will be required to determine how far to go; in other words, application access security can be tightened to the point of inconvenience to legitmate users.
Type IV disaster: Straying too far from industry standard software and hardware configurations can cause excessive vulnerability in this area. Failure to adequately compensate staff is another concern.
Plan: Create "system documentation" (not the same as the less valuable "disaster recovery manuals" referred to in the "overall approach" section above) to show the general data relationships and flows of our system. Continue to improve the organization and storage of key source code on the system. Spend adequate time on cross-training (ideally, this can be 20% of total productive time for a particular group). Create liaisons to other key technical support groups both inside and outside of THE COMPANY, for mutual assitance and advice. Provide adequate career paths for staff. Note: these activities have historically received little management support, and can not be carried out unless sufficient time is allowed.
Type V disaster: although the risk may seem generally low, THE COMPANY as a "deep pockets" organization is a special target for this type of disaster. For example, some special legal steps were required not long ago to keep us from being sued for breech of contract by a software outfit whose product was originally purchased at the beginning of the APPLICATION project. Also, vendors may at any time choose to aggresively audit our licensing agreements, looking for violations.
Plan: maintain mangement vigilance over software license agreements.
[1]This column is intended to encompass, in a general sense, virtually all types of "disaster". The word "preventable" is used in the sense that steps can be taken to lessen the likelihood of occurence, as in "preventable accidents", but it does not indicate that the event can be completely avoided at any reasonable cost.
[2]This column can be used to make a mangement assessment of the potential cost of a disaster, to support decisions around self-insurance, calculated risk-taking, etc.