Why we need data we can believe in

analysis

May 6, 20134 mins

As we plunge headlong into the cloud era, we need reliable, API-accessible data sources to power a new generation of applications

You’ve probably heard the cliché a million times: Data is the most valuable asset of any business. Originally, that was meant to apply to a company’s financial, customer, and product data.

But what about data outside the organization? Companies spend selectively, often at premium rates, for data directly relevant to their business — D&B for financial data, Experian for credit information, and so on. But as with software, the market for data has opened up, with much of it available free of charge.

[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors’ 21-page Cloud Computing Deep Dive PDF special report. | Cut straight to the key news for technology development and IT management with the InfoWorld Daily newsletter. ]

As InfoWorld’s Paul Krill noted last month, developers are increasingly turning to API-accessible data sources for their applications. Last week InfoWorld posted “12 APIs every programmer should know about,” which includes everything from a feed of real-time flight delays to the definitive repository of the U.S. government’s social media accounts.

A number of aggregators pull together a wild mix of data sources. The Windows Azure Data Marketplace was an early mover and today offers 167 data sources, 82 of which are free. The Programmable Web, a decade-old Web directory recently bought by MuleSoft, lists thousands of APIs that return data, though many have fallen into disrepair. Several upstarts, such as the big data venture InfoChimps, aggregates thousands of data sets and APIs — although, again, many are out of date or no longer available.

The data-as-a-service game is tough. A startup called Factual launched in 2007 with ambitions of becoming a clearinghouse for a huge range of data, but narrowed its sights in 2010 to delivering high-quality, location-based data. I recently interviewed Factual’s founder and CEO, Gil Elbaz, who also co-founded Applied Semantics, the developers of AdSense, bought by Google for $102 million in 2003.

When I asked Elbaz about the technology behind Factual, it quickly emerged that most of it was devoted to ensuring data quality. He says, “You need to run cleaning algorithms against the raw data. The paradigm that we believe in is that you should always store and reprocess data from fundamentals. So we store the rawest form of data — all the data we collect, either from the Web or from our partners. Any time any of our cleaning algorithms improves even slightly, we don’t apply it to the database — we apply it to the underlying raw data sources, which is why we have such large storage requirements.”

In other words, although the Factual service offers a relatively modest 67 million listing of local businesses and points of interest around the world, it needs nearly a petabyte of HDFS storage to maintain the source data and cleanse it recursively.

“I don’t think there’s been enough emphasis on thinking deeply about what’s the best possible workflow for good data,” says Elbaz. “Data in itself is not factual until you’ve processed it with some sort of workflow that improves its clarity and provides more metadata.”

But the problem is that the effort spent on ensuring data quality is not immediately apparent to the customer. “The unfortunate reality is that it’s really hard to build a brand in data. It would be nice to live in a world where the data would speak for itself and somebody could apply a seal of approval, but we really don’t live in that world today,” says Elbaz.

Today the obsession is with big data analysis of semi-structured data — which is highly useful for spotting trends, but has nothing to do with accuracy at a granular level. Meanwhile, in the broader sphere of the Internet, made-up “facts” sit side by side at a peer level with the real thing. Even the quality of data exposed by worthy initiatives like Data.gov has been called into question.

There’s lots of talk about making business and government data available on the Internet, but not nearly as much conversation around the much more difficult problem of validating that data. Data provided as a service in the cloud needs to aspire to be as valuable as core data maintained by customers. But unfortunately, no independent agency exists to give a stamp of approval to the good stuff.

Perhaps we simply need to wait for trusted brands to prove themselves in practice. It could be Elbaz is right when he says, “Everything is an opinion about a fact unless there’s some company behind it saying, ‘We have a strong feeling about this.'”

This article, “Why we need data we can believe in,” originally appeared at InfoWorld.com. Read more of Eric Knorr’s Modernizing IT blog. And for the latest business technology news, follow InfoWorld on Twitter.

Cloud ComputingDatabases

by Eric Knorr

Contributing writer

Follow Eric Knorr on X

Eric Knorr is a freelance writer, editor, and content strategist. Previously he was the Editor in Chief of Foundry’s enterprise websites: CIO, Computerworld, CSO, InfoWorld, and Network World. A technology journalist since the start of the PC era, he has developed content to serve the needs of IT professionals since the turn of the 21st century. He is the former Editor of PC World magazine, the creator of the best-selling The PC Bible, a founding editor of CNET, and the author of hundreds of articles to inform and support IT leaders and those who build, evaluate, and sustain technology for business. Eric has received Neal, ASBPE, and Computer Press Awards for journalistic excellence. He graduated from the University of Wisconsin, Madison with a BA in English.

Show me more

Topics

About

Policies

Our Network

More

Why we need data we can believe in

As we plunge headlong into the cloud era, we need reliable, API-accessible data sources to power a new generation of applications

More from this author

Can AI solve IT’s eternal data problem?

The great cloud computing surge

The multicloud challenge: Building the future everywhere

The 2020 IDG Cloud Computing Survey

The state of cloud computing in 2020

Containers march into the mainstream

Containers march into the mainstream

IBM Cloud Q&A: Kubernetes takes center stage

Show me more

New ‘StoatWaffle’ malware auto‑executes attacks on developers

7 safeguards for observable AI agents

VS Code now updates weekly

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)