High availability, disaster recovery, and business continuity often fail due to poor design. Here's how to do them right -- even in the cloud In my post last week, I dug into some of the perils of moving data to the cloud without having your own plan for bringing your services back online in the event that the provider fails. Judging by some of the email responses, a lot of folks out there are shocked that Amazon.com could consider the loss of its customers’ data to be a relatively normal event. Some of that surprise may stem from a lack of understanding about how Amazon.com has designed the various services that live under the Amazon Web Services (AWS) umbrella. But I think much of it may be due to a more persistent industry-wide problem: widespread confusion around what high availability (HA), disaster recovery (DR), and business continuity (BC) really mean. These terms are thrown around all over the place, but are frequently misused or misunderstood. Why does this matter? Because these terms have three very important things in common: mission-critical business applications, large amounts of money, and setting expectations with high-placed and often very nontechnical business stakeholders. And let me tell you from firsthand experience, if improperly managed, these ingredients can be a recipe for disaster all by themselves. Being very clear with yourself, management, and business stakeholders about how you’re spending money on HA, DR, BC, and what that money is going to buy you is one of the keys to a happy life in IT. Setting the stage To compare and contrast HA, DR, and BC, it helps to have some examples to build on. To that end, let’s imagine three different mission-critical services that we want to ensure are as available as possible. The first is an on-premise SQL server running the database for a line-of-business application. The second is a an on-premise SAN-attached virtualization host that contains a mess of different virtual machines ranging from application servers to infrastructural servers like domain controllers and file servers. The third is a Linux-based Web server hosted on AWS. These three examples are all dramatically different from an infrastructural standpoint, and so are the approaches used to ensure that they’re up and ready for business. Although their HA, DR, and BC solutions are very similar in concept, they share very little in execution. High availability Simply put, HA is a means to reduce your exposure to disaster by increasing the number of failures that must occur to cause a disaster. HA does not and cannot ensure that disaster won’t strike. You can provide all of the redundancy you want, and I can still guarantee there will be failure vectors that can skip right on past every one of them and ruin your day (data corruption, bad software, power spikes, storms, fire, and so on). That is perhaps the most important lesson to learn about HA: It’s giving only you a lower probability of experiencing disaster, not making you immune to them. Applying HA to my three examples is fairly straightforward. There are certainly many different ways to go about it — all providing varying levels of high availability at equally variable cost, but these are some common approaches. In the case of the SQL server, there a several common failure vectors to protect against. Most servers today include provisions for redundant power supplies, error-correcting memory, and RAID arrays — all of which could be considered types of high availability in that they allow a component to fail without causing interruption to service. However, most servers won’t protect you against a main board failure or OS instability. Thus, you might implement a second server and configure transactional replication between the two or take it a step further and implement shared storage (SAN) and full clustering. That gives you the ability to weather the failure of either host and significantly decreases your exposure. The same approach can be used with the virtualization host. By adding a second host, clustering it with the first, and attaching it to the same SAN, you’ve allowed either host to fail with only short-lived service interruption. However, that still leaves the SAN as a single point of failure. Due to cost, most enterprises won’t opt to deliver the same level of HA at the shared storage level (by implementing a second on-site SAN with synchronous data replication) — instead opting to rely on internal SAN HA components, such as redundant controllers and fabric switches. Cloud-based HA is often a bit tougher to wrap your head around because the infrastructures that back different cloud services vary widely. If you’re using AWS, you know you already have a lot of HA built in simply because of how Amazon EC2 and EBS are constructed. If an Amazon EC2 compute node fails and you’re using EBS disk resources, your EC2 instance can be restarted on a different compute node without too much fanfare. Similarly, each block of EBS disk is replicated among multiple storage nodes in Amazon’s EBS network. So, Amazon has already done most of the common HA footwork for you. Disaster recovery Unlike HA, DR does not seek to make your services more available by avoiding the impact of a disaster, but instead allows you to recover from a disaster when it inevitably happens. The biggest differences between HA and DR from a nontechnical perspective are the amount of time it takes to recover and how stale the recovery data is — often reflected in terms of RTO (recovery time objective) and RPO (recovery point objective). An HA event (such as a cluster node failing over) might introduce a minute’s worth of service interruption, whereas recovering a backup onto a new piece of hardware could take hours or even days. DR’s job is not to ensure consistent uptime, but instead to ensure that you can recover your data in the event that it’s lost, inaccessible, or compromised. As with HA, approaches to DR vary widely. The approach you’ll decide to use for any given application depends on the RTO and RPO that you want to achieve and how much money you have. In the case of the example SQL server, your approach might be as simple as tossing a tape drive onto the server and doing nightly backups. As long as your tapes are stored somewhere secure (preferably off-site), you’re protected from most disasters that might strike. If you need a shorter RPO, you might layer on periodic transaction log backups to be shipped off to an onsite NAS or perhaps to a cloud-based storage such as Amazon S3. The same approach could be used for virtualization hosts. However, the presence of a SAN gives you a few more options that you could layer on top of traditional backup to achieve better RPO and RTO. For example, you could implement a second SAN, this time at a different site (or a different building if you have a campus), and configure them to replicate. However, this time you wouldn’t opt to use synchronous replication and would instead use multiple layers of asynchronous replication. This is an important distinction to make. Data corruption would immediately spread to your second SAN if you were to use synchronous replication. By choosing asynchronous replication, you get a wide array of recovery points to choose from. A DR SAN also doesn’t need to be configured with the same amount of resources as the primary SAN. Instead of using lots of high-speed online disk, you could opt to use fewer, large-capacity nearline disks. DR in the cloud is conceptually similar to protecting on-premise equipment, and this is precisely what most of the users who were ill-prepared for the Amazon failure didn’t account for. Backing up your EC2 instance might be as simple as taking periodic snapshots of the underlying EBS disk instance, thus copying it onto Amazon’s S3 storage service, which boasts far better data durability than EBS (though at significantly lower performance). In addition to that, you might also configure periodic backups of the EC2 instance down to local, on-premise storage to completely divorce the DR plan’s dependence on Amazon. Business continuity BC ties together the availability goals of HA with the recoverability goals of DR. A correctly conceived BC design will allow you to recover from a full-blown disaster with very limited downtime and zero data loss. It is by far the most involved and expensive approach to take, but many enterprises have concluded that their dependence upon their data has grown to such an extent that BC is too important not to pursue. BC almost always involves some form of site redundancy — allowing business to continue in the event that the primary data center is rendered unavailable for whatever reason. A word of caution: Don’t forget the network. Once you start talking about having BC resources located at remote sites or in the cloud, you need to have easy ways to fail over to them. Whether that means using dynamic routing to allow whole swaths of your data center to suddenly “appear” at a remote site (without addressing changes) or reconfiguring clients to access the services at the remote site, the networking component of BC is a challenge that should not be overlooked. BC for that example SQL database might look pretty similar to a combination of the HA and DR approaches I’ve suggested here, but with the introduction of compute and replicated storage resources at a remote site. You could extend what you’ve already done by implementing a third SQL server at a remote site (or in the cloud) that would also receive high-frequency data replications. That would allow fast recovery of the database in the event of a complete site failure. BC for the SAN-based virtualization cluster is again similar to that for the stand-alone SQL server in that you’ll want redundant compute (server) capacity located at a remote site. As in the DR example, you’ll probably also want a second SAN running asynchronous replication. However, this time you’re going to want to locate the secondary instance at a different site and configure it with enough transactional performance to keep up with a full production workload — probably a mirror of the configuration at the primary site rather than a stripped-down, low-performance configuration. A solid cloud-based business continuity design really requires a thorough understanding of how your cloud provider works. Using Amazon AWS as an example, you’d want a second EC2 instance, but this one should be located in a different AWS availability zone from the first. So, if the first instance is in U.S.-East, you’d want the second to be at least in U.S.-West (if not one of the more expensive international zones). Then you’d do some scripting on the primary server to have it periodically ship incremental live-state backups to the secondary. In fact, you could even include turning the secondary instance on and off before and after the replication to save you some cash. In the event that the primary EC2 instance failed, Amazon’s Elastic IP assignment could be used to shift traffic to the backup without any users being the wiser. Some might even question whether that approach takes things far enough — especially given that there has been at least one instance where a failure in one AWS availability zone hurt services at others. If you find that you’re not comfortable working within a single provider, you could always replicate your data to a completely different cloud provider or to on-premise hardware. However, that would involve designing an addressing redundancy system to replace Amazon’s Elastic IP (whether that’s simply modifying DNS or something more complicated). Putting it all together Whatever approach you end up using to satisfy your HA, DR, and BC requirements, make sure that both you and your stakeholders are using the correct terminology and understand what is actually being bought by the investments being made. Business stakeholders, no matter how nontechnical they are, should understand how quickly you’ll be able to recover from the entire range of failures that might occur and what it will cost for them to improve those numbers. The last thing you want to be dealing with in the midst of a disaster is an army of suits with overblown expectations of 100-percent uptime wondering why you didn’t manage to live up to them. This article, “The ugly truth about disaster recovery,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. DatabasesTechnology IndustryData ManagementIaaSCloud Storage