How to wrangle distributed data

Delivering fast and reliable access to enterprise storage across a wide area network can be a huge challenge. Here are some common strategies and pitfalls

Managing corporate data is hard enough if your organization has only one site. Throw in a multitude of remote sites spread across the country, and you’re talking about a real challenge. As our data grows at unprecedented rates, higher costs for new and upgraded WAN circuits seem inevitable. But take heart; there is hope. All you have to do is avoid what most people do wrong as they manage distributed environments — and wrap your head around the right-thinking approach.

The status quo

Today, in most multisite scenarios, each site has its own storage pool — whether it’s NAS/SAN or a traditional file server — and maintains its own data, sharing with other sites as necessary. Though not particularly storage-efficient, with proper management this can work well if each site represents a distinct business unit with its own data. But few of us are lucky to have such hard lines of demarcation. Most times, significant portions of data must be shared across sites.

[ Looking to revise your storage strategy? See InfoWorld’s iGuide on the Enterprise Data Explosion. ]

In the conventional model, there are a few ways to deal with this. One is to define a home location for the data in question and have all of the other sites access this storage pool over the WAN as needed. This generally results in lots of WAN bandwidth being consumed by the same files moving from servers at one site to clients at another.

Another option is to replicate copies of data at all remote sites so that those copies can be accessed locally, but this invites the nightmare scenario where users at more than one site modify the same files.

Clearly, there must be a better way.

Bringing the users to the data

Perhaps the easiest way to deal with storage a distributed multisite network is to keep all of the storage resources at a single site and bring the remote users to it through the use of Server-Based Computing (SBC). Between Microsoft Terminal Services, Citrix XenApp, and the various available VDI implementations, there is a remote computing option that will fit nearly any workload.

The primary benefit of this approach is that it replaces short-lived, high-bandwidth file transfers with relatively constant, low-bandwidth user sessions. Moreover, expenditures on data center infrastructure such as storage and backup hardware can be centralized to a single location which can create massive capital and operational efficiencies.

This approach is not without its drawbacks. The most critical issue is that SBC places a significant emphasis on the reliability of the WAN circuits used by remote sites. If that circuit goes down, the remote site doesn’t just lose access to corporate data housed at the headquarters site, it is dead in the water until connectivity is reestablished. There are many excellent ways of providing highly reliable redundant connectivity for remote sites, but depending upon the locations of your remote sites, this may not always be a viable option.

Other SBC drawbacks include those presented by any need to handle technologies such as high-bit-rate audio/video, high-end imaging, and 3-D modeling. Technologies that will make these applications work well in SBC environments exist and are maturing, but haven’t developed to the point that the “full desktop experience” can be delivered to any type of user without consuming significant amounts of bandwidth.

Distributed file systems

Distributed file systems are file systems that are constructed to present a single file and folder hierarchy in multiple physical locations at the same time. The most robust of these solutions are commonly used in grid computing environments where very large numbers of geographically diverse computing nodes must share a common storage environment simultaneously. But most of these option target specialized data.

The challenges in engineering a system like this are significant. Most importantly, the system has to be able to maintain file locks across the entire organization regardless of location. That is, if you are editing a document in Chicago, the storage resources presenting that same storage hierarchy in Los Angeles have to know to prevent users there from modifying the file at the same time you are. This is the most common downfall of do-it-yourself attempts at site-to-site file replication for use in collaborative environments.

In addition, the system should be able to move storage to where it is used most frequently to decrease inter-site replication loads. Blindly replicating huge amounts of data to remote sites that won’t use it burns the very bandwidth you’re trying to save as well as requiring significant storage hardware cost duplication.

A very common, yet incomplete example of this kind of system can be found in Microsoft’s Distributed File System (DFS) found in Windows Server 2003 R2 and Server 2008. This version of DFS builds on previous versions found in Server 2000 and 2003 in that it offers native file replication as well as a unified, transparent file hierarchy. With DFS, you can present an entire organization’s file sharing tree as a single hierarchy, regardless of where the data is actually stored. You can also configure sections of that storage to be replicated to the sites that are likely to use them. However, one thing that Microsoft’s DFS doesn’t do is file locking — again leading to the same concurrent access issues found in generic file replication. DFS replication can be a great answer for data that is only used by a single user at a time or for read-only data, but not for collaborative data.

WAN acceleration

A broad range of WAN acceleration devices that have arrived on the market within the past few years. These options include products such as Cisco’s WAAS, Citrix’s Branch Repeater (previously WANScaler), and Riverbed’s Steelhead. These devices and many others like them all have different strengths and weaknesses that make them better for some tasks and worse for others, but they do share two basic concepts.

The first is network protocol acceleration. This is an area that has been inhabited by many a snake oil salesman in the past and has gotten a bad name in some circles. However, if done correctly, there are a lot of optimizations that can be made to various types of network traffic to allow the available network bandwidth to be used more efficiently. The goal is usually to tweak the basics of the IP traffic crossing a WAN to allow it to move across more smoothly and decrease congestion. When this is done properly, circuits can be driven nearly to 100 percent utilization without introducing congestion.

The second major characteristic is network deduplication or caching. When implemented correctly, this feature set can have an enormous impact on the effectiveness of your WAN bandwidth — sometimes appearing to magnify it many times over. Essentially, network deduplication functions between a minimum of two WAN acceleration devices: one at the head end site and one at the remote site. When data (a file, for example) is requested from the head end site and sent to the remote site, the head-end accelerator makes a note of what that traffic flow looked like. At the same time, the accelerator at the remote site caches that data on its own disk.

The next time that same file (or a file with significantly identical portions) is requested again, the data is thawed out of the cache on the remote-side WAN accelerator and sent directly to the client without being resent over the WAN. This can massively accelerate lots of different types of traffic, including file-sharing data, e-mail data, and SAN-to-SAN replication.

Combined with the single-hierarchy presentation available in Microsoft’s DFS, implementation of a caching WAN accelerator can be an extremely effective solution for WAN bandwidth creep. Though the best of these devices can be quite expensive, when weighed against the recurring WAN bandwidth costs, they are usually a very good investment and well worth investigation.

The bottom line for distributed data

No silver bullet can solve all the challenges of managing storage in a distributed enterprise network. Yet storage centralization, server-based computing, and WAN acceleration all offer good options that can improve the efficiency of your storage hardware investments — and slash your recurring WAN bandwidth costs.

This story, “How to wrangle distributed data,” was originally published at InfoWorld.com. Follow the latest developments in data management at InfoWorld.com.

Technology Industry

Topics

About

Policies

Our Network

More

How to wrangle distributed data

Delivering fast and reliable access to enterprise storage across a wide area network can be a huge challenge. Here are some common strategies and pitfalls

More from this author

The InfoWorld guide to disaster recovery done right

How to survive the data explosion

Review: Dell Compellent storage delivers flash speeds at disk prices

5 must-have capabilities for every monitoring system

The secret to troubleshooting performance problems

Cloud audits often don’t mean what you think they do

Where cloud backup fits the bill

What you need to know about today’s SSDs

Show me more

Visual Studio Code previews chat customizations editor

Databricks pitches Lakewatch as a cheaper SIEM — but is it really?

Google targets AI inference bottlenecks with TurboQuant

How to run your own little local Claude Code (sort of!)

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy