matt_prigge
Contributing Editor

Moving data to the cloud? Read this first

analysis
Jul 5, 20115 mins

Protecting cloud-based infrastructure from downtime and data loss requires new skills and new approaches

Today’s systems have become so complex, most IT practitioners expect failure. Everything fails, and we invest the time, energy, and capital to build backups, warm sites, and all manner of redundancy so that we can stand things back up when they inevitably stumble.

Cloud-based infrastructure services have taken the lessons learned about data protection and stood them on their head. Suddenly, instead of building a comprehensive, on-premise data protection mechanism, we’re tossing our data in the cloud, where the only thing we really have to show for it is a fancy SLA that says our cloud provider probably won’t lose our data. It’s not exactly awe-inspiring.

You only need to look at the fairly well-publicized Amazon Web Services failure from a few months ago to see the result. The AWS forums were packed full of livid EC2/EBS users who had experienced extended downtime or even lost data during the outage. Does this mean that the cloud (AWS or otherwise) is an unreliable piece of junk we should all avoid? Of course not.

What it does mean is that we have a lot to learn as we bridge the experience gap between on-premise boxes made of sheet metal and seemingly locationless services objects floating in the free space of the cloud.

Lesson No. 1: Forget the SLA

SLAs are great. Before you enter into a service contract for anything with anyone, you should subject the service-level agreement to intense scrutiny. The fine print will give you a lot of insight into how your provider will react in the event it fails to deliver on its reliability promises.

Next, forget you have an SLA. No matter what kind of paper it’s printed on, an SLA can’t get your data back for you if it’s lost. No matter how good it is, the refund will never, ever make up for that — just as your homeowner’s insurance can’t replace all of your family heirlooms should your house burn down. A solid SLA really just provides motivation for the service provider to avoid screwing up, not a guarantee that it won’t.

Lesson No. 2: Expect failure

No matter how good a given cloud provider’s internal redundancy is, you need to plan for it to fail. No redundant system, no matter how well engineered, can survive the wrong combination of failures. This applies to off-premise, cloud-based services just as it does to traditional on-premise IT infrastructures. I have seen underdesigned, on-premise backup systems fail catastrophically, and I’ve seen several different cloud-based systems fail to work as advertised.

The Amazon debacle stemmed from a short period of network disruption during an upgrade that caused a localized failure, which then cascaded into serious, systemic failure. To be sure, Amazon has promised it will correct the design deficiencies that allowed the initial problem to cascade, but no one can foresee every eventuality that might cause this kind of widespread failure.

Moreover, not every cloud service is designed to be failure-proof. Take Amazon’s EBS (Elastic Block Storage), for example — it is stated to have an annual failure rate of 0.1 of 0.5 percent. That means if you field 1,000 EBS volumes, you can fully expect up to 5 of them not to survive a year without being destroyed. Those aren’t bad odds as far as disk resources go, but it’s clearly an eventuality worth planning for.

Lesson No. 3: Understand the infrastructure — even if it isn’t yours

Here’s the rub with the cloud: Just because you aren’t tasked directly with operating the cloud infrastructure that runs your services doesn’t mean you don’t need to develop the skills to understand how it works. In fact, quite the opposite is true. One of the main reasons why so many Amazon users were so badly affected by the EBS outage was because they didn’t fully understand what made Amazon’s services tick and how to use them appropriately — though whether that’s a result of a failure of comprehension or documentation is open to debate.

In Amazon’s case, that requires a thorough understanding of the data durability differences between Amazon EBS and Amazon S3 and what benefit locating redundant services in different availability zones might grant you. Furthermore, designing, scripting, and regularly testing a solid game plan for what you’ll do when failure does strike is critically important.

Putting it all together

Cloud-based infrastructure can provide a wealth of scalability and agility benefits, but it does not free you from worries over where your data resides or how it’s being protected — in the end, that’s still up to you. Whether or not you plan to use Amazon’s service, take the example of those that fared poorly and those that weathered the storm of that failure to heart.

If it seems like too much to wrap your head around right now, don’t be afraid to dip a toe in the water and become better prepared. I don’t believe that cloud-based services will ever entirely replace on-premise infrastructure, but it’s clear that they are here to stay. And we’d better learn to adapt.

This article, “Moving data to the cloud? Read this first,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.