matt_prigge
Contributing Editor

Wrangling your unstructured data

analysis
Jan 5, 20105 mins

Quick and easy tips to gain control over that sprawl of files that won't fit in a database

Unstructured data is one of the biggest contributors to the data explosion. Defined as just about any kind of data that lacks a strict data model — essentially, any data that isn’t in a database of some kind — unstructured data includes log files, documents, audio files, and images. This kind of data is difficult to manage due to the wide range of formats and lack of standardized metadata attached to them.

Here are some quick tips that will help you monitor and control how this data is created and stored in your environment.

Put someone in charge

Methods used to manage this data usually involve one of two different strategies: Either draw the data into a database where it can be easily mined, archived, and eventually discarded, or try to apply an organizational structure to the way that mixed data is stored. The former is often used with data that has a somewhat consistent format, such as log files. The latter is often the only avenue open to generalized file-sharing data short of a comprehensive document management system.

The first thing you can do is to make someone responsible for every piece of data your organization generates and maintains. You should never allow data to be created on your network without knowing precisely who is responsible for it. For example, never, ever make a “public” file share that anyone can dump data into. These shares are disasters because no one can claim ownership of anything, making it almost impossible to determine what should be archived and what should be deleted. Worse still, the loose permissions structure required for a public share is almost a guarantee that privileged data will eventually be exposed to users who should not see it.

Most file-sharing data consists of three main types of data: user data, departmental data, and working group data. User data is fairly straightforward to deal with; any overruns in storage quotas are easy to link to a single individual and can be dealt with accordingly. But with departmental data or working group data (the latter may involve contributors across the company), you must always make a single individual responsible for the data in that group.

Apply storage quotas

Implementing quotas is easier said than done. What user likes limits that restrict the ability to create and store data? Even if you aren’t able to enforce quotas completely, though, you will at least be able to observe why the data is growing and make better planning choices as a result. Depending on what you use to store your data, you may already have built-in quota tools.

Calculate that cost per gigabyte

Set a cost for storing data and make clear to management and users what that cost is. This is perhaps the most effective way of controlling data growth. While many users might not have a good concept of what 500GB is, they will understand a dollars-per-gigabyte cost, especially when multiplied across each owner’s data.

When you figure that cost, don’t stop with the storage hardware. Include a slice of everything and anything that allows that data to be accessed, stored, protected, monitored, and managed. That includes the fractional cost of networking hardware, snapshot and replication space, file servers, backup software, backup tapes, management software, and administration time. This number will be a much closer approximation. If you’ve never calculated it before, it will probably surprise you.

Keeping a watchful eye

After you’ve done all that work, don’t let it wither on the vine. Keep monitoring the growth of your data. As you scope out new storage resources, that information may point toward new solutions — automated data archiving and deduplication systems, for example.

If you don’t have some kind of data-monitoring tool already in place, you may want to check out Aprigo’s NINJA, an extremely easy-to-use SaaS application that will scan and report on a large unstructured data set. It’s currently in an open beta stage that limits scans to 500GB, but yields surprisingly useful results with almost no installation or configuration effort. The version I looked at made it easy to apply a storage cost and report on it, as well as compare results to the aggregated results of other users. If you’re starting from scratch, this is probably the easiest way to gather this information without spending a lot of time doing it.

Forewarned is forearmed

Take these initial steps, and you’ll be in a much better position to plan out more advanced data management techniques. The more information you have on the data you store, the less likely you’ll be to find yourself with no space for growth. The last thing anyone wants to do in this economic climate is take a long walk down the hall to make an unbudgeted request for new storage hardware.

This story, “Wrangling your unstructured data,” was originally published at InfoWorld.com. Follow the latest developments in data management at InfoWorld.com.