Deduplicating backup appliances have become very popular, but choosing the right one means looking beyond deduplication Although disk-to-disk backup methodologies have become incredibly popular over the past few years, the vast majority of enterprises — large and small — still use the same tape backups they implemented years ago. As time goes on, however, more and more old-school backup implementations will reach a breaking point where either capacity or performance can’t get the job done. When you realize that tape can’t cut it any longer, you’ll likely consider using a disk-based backup appliance, which you can get from many vendors, such as EMC Data Domain, Exagrid, and Quantum. But when choosing the right appliance, be careful: Most buyers focus on the most efficient deduplication engine, but that’s only one difference to explore. [ Get more expert advice on backup with InfoWorld’s Deep Dive special reports on backup strategy, deduplication, and email archiving. | Sign up for InfoWorld’s Data Explosion newsletter for news and updates on how to deal with growing volumes of data in the enterprise. ] The deduplication engine gets IT’s attention because the whole point of implementing dedupe is to shrink the amount of storage you need to hold your backups — to save both on physical storage costs and to gain longer on-disk retention times. But capacity efficiency is a relatively small issue in practice. Most of the significant operational differences are based on when in the backup cycle that deduplication takes place and how crucially important scalability is achieved. Inline vs. postprocess deduplication When you get down to it, there are two predominant dedupe methods in use today: inline deduplication and postprocess deduplication (also known as dedupe at rest). Each approach has significant strengths and weaknesses. In inline deduplication, data is deduplicated as it is backed up to the appliance. This approach results in the smallest amount of back-end storage usage, because only the deduplicated data stream is written to the appliance’s disk. However, it limits the speed at which data can be saved, because the deduplication processing typically requires very large amounts of processor and memory capacity. That same processing has to happen in reverse when you restore the data: Instead of simply reading the data from disk, the appliance must also “rehydrate” the data into its original un-deduplicated state. That’s also processor-intensive. In postprocess deduplication, data is written to the appliance as fast as the network and disks allow. Only once the data has been transferred to the backup storage does the appliance do the depuplication. This task is also processor-intensive, but because it doesn’t have to keep up with the flow of backup in real time, as in the case of inline deduplication, it’s faster overall. And there’s usually no rehydration performance penalty during restore, as most postprocess appliances keep a non-deduplicated copy of the data. Of course the postprocess approach requires significantly more storage capacity at the backup end — nearly double, in some cases. And there’s a delay in your ability to replicate the backup to a second, mirrored appliance: You have to wait until the deduplication is complete, creating a window of vulnerability and perhaps undermining a stringent multisite backup availability SLA. Scale-up vs. scale-out As anyone managing storage resources today knows, data is growing at an almost unbelievable clip. Although it sounds relatively fantastic, the oft-quoted IDC stat suggesting that corporate data is doubling every 18 months hasn’t been far off the mark in my own experience. No wonder so many organizations are outgrowing their tape backup systems. The scale of the data being backed up is a key factor as well. And for disk-to-disk backup appliances, there are two primary approaches to the scalability issue: scale-up and scale-out. Users of traditional SAN implementations will be familiar with the traditional scale-up approach, which typically involves the use of static controller/compute resources attached to a variable amount of storage. In these deployments, you can introduce additional capacity relatively cheaply and easily, both to lengthen retention times and to store your growing data pools. However, you have to carefully consider at the outset the sizing of your controller resources. As with scale-up SAN implementations, you have to estimate up front both the overall intended capacity and performance requirements for the end of the device’s expected lifetime — which is often difficult to do accurately in today’s quickly changing IT landscape. A failure to estimate properly might result in large, unexpected capital investments to upgrade the controllers before you had planned or, arguably worse, overbuying at the outset and retiring the device before its full performance potential was ever exercised. The scale-out approach avoids some of these pitfalls, but isn’t without its own problems. In scale-out implementations, controller resources are generally paired with fixed storage resources, and scalability is achieved by scaling the number of devices in a group as performance and storage requirements change. This handily avoids the need to perform accurate long-term planning, since each year’s backup storage investments can instead be guided by short-term requirements. It also largely avoids the risk of substantial overbuying or underbuying. However, the fixed relationship between controller and storage resources can present a problem when you require more of one than the other. For example, you may want to provide extremely long retention for a relatively small amount of quickly changing data. Doing that with a scale-out platform might require purchasing a large amount of controller resources just to get the required storage density — which would be much easier and cheaper to accomplish with a scale-up system. Some scale-out platforms also have scalability and management limitations that might make them inappropriate for very large enterprises dealing with truly enormous backup datasets. How to pick the right approach If you’re still driving tape as your primary backup target and are thinking of disk backup as a replacement, you should consider a deduplicating backup appliance. But don’t get stuck in the vendors’ cost and performance figures. Instead, be sure to consider the architectural differences and how they might affect you both today and several years down the line. There’s no one right answer for everyone, but there will undoubtedly be a right answer for your organization — if you think through the deduplication processing and scalability issues upfront. This article, “Bye-bye, tapes: How to get the right disk backup appliance,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. Data QualityTechnology IndustryData Management