matt_prigge
Contributing Editor

Review: ExaGrid aces disk-to-disk backup

reviews
Jan 3, 201315 mins

ExaGrid's unique scale-out grid architecture makes for powerful, scalable, and uncomplicated disk-based backup and deduplication

For enterprises seeking to escape the challenges of managing and maintaining tape backup architectures, disk-to-disk backup has been nothing short of a godsend. By replacing tape with disk for nightly backups and relegating tape to a long-term archival role, organizations of all sizes can shrink backup windows and provide near-instantaneous restores. While simple direct-attached storage may fit the bill for smaller organizations, larger enterprises wrestling with the task of protecting terabytes of data find themselves looking for functionality that plain old disk can’t provide.

That’s where deduplicating backup appliances really shine. While there are a number of well-known vendors with very strong product offerings in this space (EMC Data Domain and Quantum, to name two), ExaGrid’s unique scale-out grid architecture and truly refreshing support model set it apart from the pack and place it in a class of its own.

To say that deduplication technology is “hot” is something of an understatement. With rapidly growing mountains of data, leveraging dedupe in backup (if not primary storage) has almost become a necessity. However, as sexy as deduplication tech may be, it’s reached a point where the major dedupe vendors are, by and large, getting the same data reduction results from their deduplication engines. Today the differences reside mainly in the impact the deduplication engine has on backup and restore performance and how well the solution scales as backup data inexorably grows. This is where ExaGrid has chosen to invest the bulk of its R&D.

Scale-out vs. scale-up First, the ExaGrid EX series uses a scale-out grid architecture versus the scale-up architectures adopted by many of its rivals. That architecture allows you to combine multiple EX-series appliances — each equipped with dedupe and network capacity matched to its storage capacity — into a linearly scalable grid. This is important because it handily deals with the one true constant of any storage architecture today: rampant growth.

Because scale-up architectures are typically dependent on static controller resources that are sized when the system is initially purchased, an unexpected spate of growth might force you to replace those (often very expensive) controller resources well ahead of when you might have expected. With ExaGrid’s scale-out approach, you simply add another appliance to the grid and scale your storage capacity and backup performance at the same time. It’s about as close to pay-as-you-go as you’ll get this side of the cloud.

Inline vs. post-process Second, on the deduplication front, ExaGrid’s EX series uses a post-process deduplication model. This means that the backup data is written to the device in its fully “hydrated” form and is deduplicated after the backup process is complete. This is in contrast to the more popular inline deduplication model, which sees incoming data deduplicated as it is written to the device.

A few years ago, I would have unapologetically derided ExaGrid for taking the post-process approach. From a storage efficiency standpoint, it’s clearly a poor choice: Storing the most recent backup in its native, undeduplicated form, then creating a separate deduplicated copy would seem to nearly double the amount of storage the appliance would need to do its job. No surprise, it typically does. That’s neatly reflected by the fact that that ExaGrid’s model numbers imply a capacity half that of the device’s actual usable capacity (the EX1000 and EX13000E have 2TB and 26TB usable capacity, respectively).

If it’s bad for storage capacity, why use post-process dedupe versus the much more miserly inline dedupe? One reason is that deduplication requires a whole lot of compute and disk performance to do effectively. Realistically shooting for 20:1 dedupe (and beyond) means you must devote a large amount of compute resources to finding every bit of duplicate data and removing it. If you’re going to provide inline dedupe capability without the dedupe engine becoming a bottleneck to backup performance, you need to throw a lot of expensive, high-end compute hardware at the problem. From ExaGrid’s point of view, the cost of the additional storage required for post-process dedupe was easily offset by the cost of implementing the specialized compute performance necessary to do it inline.

InfoWorld Scorecard
Interoperability (10.0%)
Scalability (20.0%)
Performance (20.0%)
Value (10.0%)
Management (20.0%)
Data deduplication (20.0%)
Overall Score (100%)
ExaGrid EX Series 10.0 10.0 9.0 9.0 8.0 9.0 9.1

However, it turns out that’s not the only upside to post-process dedupe. Most inline deduplication solutions store the first full backup as the “reference” copy. As subsequent backups are performed, the data shared between those backups and the original reference copy don’t need to be stored but instead are referenced back to the original copy (a technique called backward-referencing). This is perfectly efficient from a backup perspective and actually serves to decrease the number of disk writes required. But what’s good for backups is bad for restores. During a restore of the most recent backup (generally the one most people want when performing a restore), the appliance is forced to “rehydrate” that backup from its most deduplicated form — again placing a very heavy load on the controller resources.

By delaying the deduplication pass until after the backup is complete, ExaGrid can use an even more computationally costly dedupe methodology referred to as forward-referencing deduplication. Instead of storing the first backup as the reference copy, then only deduplicated differentials, ExaGrid ensures that the most recent backup is always in the least deduplicated form. Thus, a restore of data from the most recent backup ends up being far faster than a restore of an old backup (which is typically exactly what you want).

While it’s true that the backward-referencing approach has very little impact on the performance of small restores, it can have a substantial impact on very large restores. As server virtualization grows almost ubiquitous, restore jobs involving multi-hundred-gigabyte virtual machine images are becoming much more common.

Additionally, some backup software platforms are able to leverage the backup appliance’s storage to start a virtual machine directly from the appliance and use the virtualization hypervisor’s storage migration functionality to copy the virtual machine back onto primary storage (Veeam’s Instant Recovery coupled with VMware Storage vMotion is a great example of this). While a great deal of attention has always been placed on shortening backup windows, accelerating restore windows is more important today than ever before. ExaGrid’s post-process approach to deduplication meshes perfectly with these heavy-duty use cases.

ExaGrid in the lab ExaGrid EX appliances are remarkably simple from a hardware perspective. Essentially commodity servers with an ExaGrid badge on the front, they are currently available in eight different models that range in total usable capacity from 2TB in the EX1000 to 26TB in the EX13000E (the top three of which are also available in versions that include support for at-rest encryption). Beyond that, the only question you have to answer is whether or not to add 10Gbps Ethernet connectivity — an option available on the top five models of the range. As all of the models are sold with the correct balance of CPU, memory, network connectivity, and disk resources, no further options are really needed.

High-level architecture. ExaGrid appliances are organized in a site and grid hierarchy. Appliances running within a single grid are expected to have direct, low-latency Layer-2 network access to each other and effectively act as a single, unified system as it relates to backup retention. Grids consisting of one or more appliances located at different “sites” (whether or not they are truly different locations) can be configured to replicate deduplicated data to each other in a very flexible hub-spoke model.

All of the EX-series models ship with at least two physical gigabit network interfaces (the top-end EX13000E ships with six). In all cases, the first of these NICs is typically dedicated to intragrid traffic and is not used for backup traffic, even if you have only one appliance. Other NICs are used for actual host-to-appliance backup traffic. The number of available NICs in each model is roughly in line with the device’s capability to ingest backups. In the EX1000 I had for my lab testing, the single host-facing NIC, when saturated with backup traffic, is capable of moving just about as much traffic as the storage and compute hardware can handle. As you move up the product line, more host-facing NICs are added as the other capabilities of the appliances increase.

Setting up shares. After getting the device racked up and attached to power, I ran through a very simple initial setup process that involves setting an IP on the device (a task usually handled by ExaGrid support). After I had access to the device’s Web interface, I could get to actually configuring a backup repository.

In ExaGrid parlance, those repositories are called shares. The shares are tailored around what backup application is going to be targeting them and which hosts are allowed to write to them. That first bit about application support is very important. Generally speaking, the ExaGrid supports NFS, CIFS, and Symantec’s OST protocols, but just because your backup app happens to be able to use one of those protocols doesn’t mean the ExaGrid will know how to treat the data once it’s written to the device.

In order to effectively interpret and deduplicate incoming backup streams, the ExaGrid has to know what format the data is going to take and what IP storage protocol to offer it up as. Fortunately, the list of currently supported applications is very comprehensive and growing all the time.

In my case, I wanted to emulate an environment using a typical combination of different backup software: Veeam’s Backup and Replication for backing up virtual machines, Symantec’s BackupExec 2010 for backing up physical machines, a direct Microsoft SQL Server maintenance plan backup share, and a TAR backup share that might be used to back up a typical Linux server. Creating those four shares only required that I give them a name, specify which kind of software I’d be using, and specify which source IPs were allowed to access the share — about as simple as you can get. After that, I followed the directions in the application-specific best practice manuals provided for each piece of software to get the backup servers attached to the ExaGrid.

In the end, I was slinging backups about an hour after opening the box.

Managing data. After configuring some backups, I got to see how the appliance handled backup data. If you remember from when I created the shares, I was never asked to specify how much space each share is allowed to use. Instead, backup capacity is allocated dynamically — removing some management overhead.

By default, each ExaGrid appliance is configured to dedicate half of its usable storage to a so-called landing zone and half of the storage for deduplicated retention. The landing zone is where the initial raw application backups are directed without any deduplication taking place. By default, 10 minutes after the last backup file has been closed, the appliance will start its post-process deduplication sweep, which effectively copies that most recent backup in the landing zone into the least deduplicated reference backup. Any previous deduplicated backups are thinned out and left as deltas from that most recent backup.

Throughout this process, the natively formatted copy sitting in the landing zone is left untouched; it is still available to be restored without any “rehydration.” That landing-zone copy is overwritten only when the next backup pass takes place or if the retention store overflows the default 50-50 split. In all cases, whether you try to back up too much data into the landing zone or retain too much data in the retention store, the device will attempt to keep everything you throw at it. Unless you’re truly out of space, at worst you’ll be peppered with email warnings and a support incident will be created to investigate the cause.

For example, in cases where you might want to support very long retention times (many months or years), ExaGrid support can modify the default landing zone to retention store split to dedicate more space to retention and less to the landing zone. This is commonly done in situations where you might have an ExaGrid appliance located at a disaster-recovery site that will serve only to receive deduplicated backup replications from a device at the primary site. That DR device wouldn’t need a landing zone at all, so all of the space might be configured to be available for retention.

When configured in a multiple-appliance grid, each appliance contains the landing zones for a specific set of shares that you’ll assign manually as you create them. However, the deduplicated data retention store is spread across all of the appliances in the grid. There is some manual load balancing involved in ensuring all the appliances in the grid are handling a relatively similar amount of the backup load, but long-term retention is shared among all of them.

Where ExaGrid misses The ExaGrid appliances deserve a great deal of praise for what they’re able to do and the economy with which they get it done. However, nothing is perfect. I hit on a few shortcomings when working with the solution.

First, forget about link aggregation. On devices with more than one host-facing network interface, there isn’t currently support for building a link aggregation group across the available NICs. That means you must manually target your backup jobs to individual NICs on the appliance in order to load balance the traffic across them. (The only exception is when you use Symantec OST for backups. Symantec OST does appear to be capable of dynamically streaming data to different NICs.) Of course, you can avoid this issue entirely if you opt to use 10Gb Ethernet. A single 10Gb pipe would do away with any need to load balance 1Gb interfaces.

Second, the security model is simplistic. The only control you have over who can read or write from a given share is by the IP address you designate when you configure the share. Even if you’re using CIFS, which (unlike NFS v3) supports CHAP authentication, the ExaGrid does not implement it. This makes it especially important to do a good job of protecting the network segment that the ExaGrid lives on from potential threats elsewhere on the network.

Finally, backup speeds are limited by ExaGrid’s isolated landing zones. One of the only real drawbacks to ExaGrid’s grid architecture is that, while the retention stores are spread across all members of a grid, the landing zones are not. While your aggregate backups to the grid may be able to leverage the raw performance of the entire grid, a single backup will always be limited to the throughput of the appliance that it targets (and, due to the lack of NIC teaming, even a single network interface). With most backup applications, it is relatively easy to create multiple jobs to target shares, but this creates administrative overhead to manage them all and ensure some degree of parity among them.

Transitioning the existing grid model into a true scale-out NAS in which an entire grid of appliances appears as a single appliance to a backup application is no easy task. (Very few primary storage vendors with scale-out architectures have yet to artfully solve that problem with protocols like CIFS and NFS, either.) That this is a limitation in ExaGrid’s model is perfectly understandable. Nevertheless, it’s also worth noting that this is typically not a problem with monolithic scale-up implementations.

If you’re working with explosive growth in your backups and are currently managing between 5TB and 75TB of data, I’d highly recommend taking the ExaGrid for a spin. While neither the Web interface nor the commodity hardware will turn any heads, ExaGrid has turned out a solid stack of software that does one thing and does it very well. Better yet, ExaGrid’s support model — which dedicates a named support engineer who is fully familiar with the backup applications you use — is a huge asset that is largely unparalleled in this space.

Above all else, if you’re in the market for a new backup appliance, remember that deduplication isn’t everything. In a day when even Microsoft’s Windows Server 2012 boasts impressive data deduplication capabilities, dedupe on its own is hardly a distinctive feature. What makes a backup appliance stand out is its ability to scale gracefully without decreasing performance and to handle complex multisite replication topologies — two tasks ExaGrid’s EX-series appliances manage remarkably well.

List pricing: EX1000, $14,900 plus support; EX13000E, $59,596 plus support; maximal 10x EX13000E, $484,848 plus support. One year of next-business-day support is usually around 15 percent of product list.

This article, “Review: ExaGrid aces disk-to-disk backup,” originally appeared at InfoWorld.com. Read Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.