haraldsmith
Contributor

Shopping for data: what’s fit for your purpose?

opinion
Feb 7, 20185 mins

The data catalog (or data marketplace) makes finding and accessing data easier, another step toward data democratization

mobile payment online shopping desktop
Credit: Thinkstock

There’s been a shift in data strategy from defensive to offensive. Historically, data governance focused on compliance and security. It still does, but it’s expanding to address data accessibility—getting data to the people that need it to solve business problems, drive new revenue, create value, and even monetize their data.

But how do you accomplish this? In the past, I’d receive a report with data deemed relevant to my tasks, and if I needed more I’d ask someone who had the tools to get me the data. But the dramatic increase in data volume (much of it produced by automated devices) renders that method obsolete. Enter the data catalog.

Great expectations

With the rise of ecommerce, we’ve gone from browsing physical stores to browsing online catalogs. On Amazon, by setting a few simple search criteria, I can find everything from the novels of Murakami to All-Clad cookware. I can see what’s in stock and from whom. I can see how others rated the item, and what else they’ve purchased. This rich user experience has permanently changed how we shop.

It’s not surprising, then, that we expect that same experience when searching for data. Access to and understanding of data has become indispensable; jobs across all industries demand insights from data in some form. In my last post I addressed the importance of data democratization across organizations and in society at large, and that’s why what began as the data dictionary and transformed into the metadata repository is no longer adequate. Whether you call it the data catalog or the data marketplace, people now want an online location to “shop for data,” and there are characteristics that these tools must provide to be satisfactory.

Key requirements

Maybe you’re considering a data catalog for your organization and receive emails about data marketplace products or a “data bazaar.” Are any of them worthwhile? How do you choose?  

With some variation, these products all come down to how to find and access data. The first thing to consider is semantics. I’m not going to know all the permutations that developers used to identify the various attributes of the data. The semantic terms must be recognizable so they can connect me to relevant data—what business glossaries attempt to provide.

But even using the proper terms, this may be no better than a search for rock music of the 1990s on Amazon—too broad to be useful. When I shop for music, I already have certain requirements in mind beyond the genre or the artist. We have requirements for data as well, though often we don’t stop to articulate what those are. They amount to data quality requirements and articulate what is fit for purpose. Incorporating this level of insight is critical to support the data governance mission.

That’s why traditional metadata content such as data lineage and data profiling is still needed to provide insight for selection, to understand the data source, its origin, and its relationship to other data sources. But assuming we have that context, is it enough? It may be nice to know that a given source received a four-star rating, but there is no guarantee that what someone else said is of any relevance to my work.

So how do you assess the available data? You need information, and not just metadata: you need context around the data so you can not only search and filter it, but assess its fitness for your purpose.

Questions … and answers

A data catalog must allow the user to ask key questions and use those to filter results. For instance:

  • Is data complete? And not just the fields, but is the whole dataset comprehensive in relation to the business problem? (e.g., if I’m looking at the impact of weather on store inventory, I may need a dataset with weather data for all delivery routes, not just my stores).
  • Is the data consistent and valid—not as determined by the creator, but by the user?
  • Is the data understandable? If there are codes, can the user understand them?

Access to quality data is reported by some 51 percent of data scientists in a 2017 survey as the No. 1 barrier to their work. When browsing the data catalog, being able to articulate such questions, and see how the set of available data satisfies them, is critical if we are to work through hundreds or thousands of data sources. It’s not the rating of data, but the context of how well the data fits differing requirements that allows us to gauge whether it is useful or not.

The data catalog addresses the first barrier toward data democratization: finding and accessing data. A familiar, consumer-style search capability is foundational, but the ability to apply questions to the data for a given business problem is central to reduce the time required to wade through the range of data sets and quickly get to those of highest value and interest. If you’re exploring a data catalog solution, ensure that it not only captures metadata but also provides business semantics, context, and a means to evaluate the data content against your requirements.

haraldsmith

Harald Smith is Director of Product Management at Trillium Software, now part of Syncsort, and co-author of Patterns of Information Management, published by IBM Press.

Prior to joining Trillium Software, Harald worked at IBM for nearly a decade, including as a software architect. Harald has a diverse career specializing in information quality, integration, and governance products with a focus on accelerating customer value and delivering innovative solutions. With experience in IT, consulting, and product and project management across multiple industries, Harald blends together business, user, and technical perspectives in the effective understanding and use of data to solve business challenges.

He has written extensively on the integration, management, use and governance of information, including two IBM Redbooks, and was recognized as a Contributing Author to developerWorks. Harald has been issued four patents in the field of data management and integration.

The opinions expressed in this blog are those of Harald Smith and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author