matt_prigge
Contributing Editor

What storage virtualization really means

analysis
Sep 27, 20107 mins

Get a crash course in storage virtualization and how its various flavors can help you corral your storage infrastructure

If there’s one thing that marketing departments in tech companies around the world are unbelievably good at, it’s taking the newest and hottest technology terms and diluting them to the point that it’s nearly impossible to really know what anyone is talking about anymore. Storage virtualization provides a supreme example.

These days, storage virtualization can refer to a huge swath of products and technologies, from the simplest file systems all the way up to cutting-edge storage abstraction layers that are capable of managing petabytes of heterogeneous storage spread across the world under a single coherent management framework.

Like all forms of virtualization, storage virtualization is all about abstraction. By inserting a layer of abstraction between the storage consumer and the physical storage, you can perform a wide range of storage tricks. Taking consistent snapshots, replicating data to a redundant data center, transparently migrating data from one SAN to another, transparent backup, and zero-downtime storage failover are all examples of the incredibly handy features made possible by various kinds of storage virtualization.

Before you event think about selecting a variety storage virtualization to try and solve a problem, you need a crash course in what differentiates the various flavors.

Your basic definition of storage virtualization

In the most technical sense of the term, anything that abstracts the physical storage from the operating system that addresses it could be considered a form of storage virtualization. A simple example can be found in the file system that your hard disk is formatted with. When you open any file on your computer, it performs a lookup in your file system’s file allocation table that resolves to the logical blocks on your hard drive that make up the file. That table is essentially a database containing the necessary metadata to link a given file name that you might recognize to the actual location of the files on the physical media that the hardware will recognize.

That may seem like a simplistic place to start when talking about such a complex topic, but that same model plays out no matter what level of storage virtualization you may be talking about. It’s just an issue of scope and complexity.

Virtualized arrays

The traditional SANs of yesteryear are not a whole lot more than very high-performance, network-attached RAID arrays with additional features stapled onto them. Instead of being internal to and accessible by a single server, they’re externally located and many servers can share them. However, individual groups of physical disks are still statically formed into RAID arrays that make up a given volume.

If you want to make changes on a traditional SAN — such as expanding a volume — you need to assign new physical disk resources to the volume, and the RAID array needs to be extended across the new disks (often a time-consuming process involving significant data reshuffling on physical disks). Similarly, snapshots are usually implemented by dedicating a second volume to hold snapshot data — a process that can result in a significant write performance hit when snapshots are present.

Virtualized arrays overcome these and many other challenges by layering an additional level of abstraction around the RAID that provides the physical disk redundancy. Instead of dedicating specific disks to an array, many disks are pooled together and each virtual disk is divided into blocks sprinkled across all of them. Though implementations vary from vendor to vendor, the array will typically utilize an internal pointer table to relate the storage block address of a virtual disk as it is presented to the server to the actual physical location on the array’s RAID volumes.

This abstraction allows these arrays a great deal of freedom to handle complex operations such as volume expansions, snapshots, and replication. Instead of requiring the designation of a new physical disk to be a member of a given volume and needing to wait for the data to be redistributed, the SAN merely updates its pointer table with the addresses of as-yet unallocated blocks on the physical disks, usually an operation that can be completed instantaneously. Many virtualized SANs also allow you to thin provision virtual disks, a simple matter of waiting to allocate physical blocks to a virtual disk until they’re actually written.

Snapshots are handled similarly; instead of needing to designate a separate volume to hold snapshot data, the array simply writes new incoming data into unallocated physical disk blocks and mark the physical blocks they would have replaced as being a members of a snapshot instead of the main virtual disk. Again, this requires no data to be moved on physical disk, which usually results in very little if any performance overhead.

Some virtualized storage arrays even allow you to virtualize storage from other storage devices, essentially allowing them to be managed under the same umbrella. However, support for third-party devices can sometimes be very limited, and there may be a palpable performance hit involved when this capability is leveraged.

Network-based storage virtualization

If you think the abstraction offered by virtualized arrays give storage administrators a lot of flexibility, you haven’t seen anything yet. Instead of simply abstracting a virtual disk from the physical disks present within an array, network-based storage virtualization abstracts a virtual disk from the actual device it resides on — providing a single management interface for an entire organization’s storage devices (so long as the virtualization device supports it).

This type of storage virtualization can operate in two distinctly different ways: in band and out of band. In-band devices are generally implemented within the storage switches or appliances that sit in between hosts and the storage devices. These devices maintain the necessary metadata to map virtual disk blocks to the physical devices they actually reside on and actively proxy I/O operations between the two. This can add a small amount of latency as an additional “hop” is inserted between the host and the physical storage it is attempting to access. However, in-band storage appliances often implement their own storage cache to counterbalance this effect.

Out-of-band storage virtualization generally involves the use of a metadata server combined with software loaded onto each host that needs to access virtualized storage. Before the host attempts to access storage, it first performs a lookup against the metadata server to learn where the storage is located. Then the host is free to talk directly to the storage device without an intermediary. Though this approach avoids much of the latency that can be introduced by in-band implementations, it also prevents additional caching beyond what the physical storage device might already implement.

By placing this abstraction layer in front of multiple storage devices, network-based storage devices give you complete freedom to move the data making up a virtual disk from one SAN to another without the server having any idea that the move is taking place. Likewise, they also open the door to not only cross-vendor storage replication — generally an impossibility due to the proprietary nature of array-based replication — but also transparent storage failover. This kind of functionality can prove to be invaluable in extremely large storage environments where a wide range of storage devices are in use and downtime is not an option.

This article, “What storage virtualization really means,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com.