Blog Post

Build Recovery into the Design

,

Here in the United States it has been a powerfully active hurricane season. Helene did a lot of unexpected damage and some areas were completely devastated where recovery is going to be measured in years, just as with Hugo in 1989. Power, water, and upstream network connectivity were all impacted. This is a painful reminder, just like with Katrina, of the need for a proper disaster recovery plan.

Unfortunately, too often when I talk to others about their disaster recovery plans, especially specific to a particular application or system, I learn that the plan was developed late in the cycle. It’s like disaster recovery (and business continuity) planning is the end-of-term research paper that the professor mentioned on the first day of class, but which most students don’t start working on until the last week before it’s due. That approach on research papers usually means the resulting product is not a student’s best work; the same is true when the recovery plan is left to the end.

When I think about disaster recovery, I’m concerned with the following:

  • Can we successfully recover the hardware, OS, and software to a fully-functional state within the Recovery Time Objective (RTO) for that particular system?
  • Are w able to recover any relevant data so loss is within the Recovery Point Objective (RPO) for that particular system?
  • What does the data consistency look like with the systems that this particular system integrates with? [This one often gets forgotten.]
    • Do we know how to identify if there are data consistency issues?
    • Do we have mechanisms to address any data consistency issues discovered?
    • Can/should the system be utilized if data consistency issues are discovered and what would be the impact if we allow access to the system as we sort these issues?

The first two can be handled later in the cycle as long as the organization has sufficient hardware, licensing, personnel, and time to implement. The last one, around data, is hard to address late in the cycle. It can be addressed, but usually our ability to minimize data inconsistency is limited, especially in comparison to what our options are earlier in the development cycle. Also, if this doesn’t get looked at until close to production, there’s not much time to test the ability to detect and correct data inconsistency issues. Often times these things don’t get developed, much less tested. Questions around business impact of the data inconsistency may not even be asked until an actual disaster occurs. This should be surprising from the perspective that data is an organization’s “life blood,” but it’s not for most of us because disaster recovery preparedness is not a feature that can be deployed in a traditional sense. It’s not a visible example of making progress to a production delivery. As a result, it gets deprioritized until it’s unavoidable. Then it gets rushed through to get the check in the checkbox, meaning no one develops on an optimal disaster recovery plan which also takes into account data consistency.

The obvious solution is to start address questions around data consistency in a recovery solution early on in the development cycle. However, this requires a great deal of organizational discipline and an expectation that during cycles when this sort of planning and design is prioritized, there is the the strong likelihood that the # of tangible features delivered/completed will drop. An organization can put architectural and software development lifecycle (SDLC) requirements in to ensure that these are looked at earlier in the process, but if disaster recovery planning is still seen by the respective teams as a checkbox item, then the teams will seek to meet the “letter of the law” and not the “spirit” of it. Formal reviews, such as by an internal auditor, will help, but for this to be done correctly, the culture of the organization has to insist that recovery is built into the design of each system.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating