james_kobielus
Contributor

When big data is truly better

analysis
Sep 4, 20146 mins

Take advantage of scale when past experience indicates greater analytic value will result. But big data is not a hammer -- nor is every problem a nail

big data 15
Credit: Thinkstock

Many people assume that big data means bigger is always better. People tend to approach the “bigger is better” question from various philosophical perspectives, which I characterize thusly:

[ Big data demands nonstop experimentation. | Download InfoWorld’s Big Data Analytics Deep Dive for a comprehensive, practical overview of this hot topic. | Cut to the key news for technology development and IT management with our once-a-day summary of the top tech happenings. Subscribe to the InfoWorld Daily newsletter. ]

  • Faith: This is the notion that, somehow, greater volumes, velocities, and/or varieties of data will always deliver fresher insights, which amounts to the core value of big data analytics. If we’re unable to find those insights, according to this perspective, it’s only because we’re not trying hard enough, we’re not smart enough, or we’re not using the right tools and approaches.
  • Fetish: This is the notion that the sheer bigness of data is a value in its own right, regardless of whether we’re deriving any specific insights from it. If we’re evaluating the utility of big data solely on the specific business applications it supports, according to this outlook, we’re not in tune with the modern need of data scientists to store data indiscriminately in data lakes to support future explorations.
  • Burden: This is the notion that the bigness of data is not necessarily better or worse, but it is simply a fact of life that has the unfortunate consequence of straining the storage and processing capacity of existing databases, thereby necessitating new platforms (such as Hadoop). If we’re not able to keep up with all this burdensome new data, or so this perspective leads us to believe, the core business imperative is to change over to a new type of database.
  • Opportunity: This is, in my opinion, the right approach to big data. It’s focused on extracting unprecedented insights more effectively and efficiently as the data scales to new heights, streams in faster, and originates in an ever-growing range of sources and formats. It doesn’t treat big data as a faith or fetish, because it acknowledges that many differentiated insights can continue to be discovered at lower scales. It doesn’t treat data’s scale as a burden, either, but as simply a challenge to be addressed effectively through new database platforms, tooling, and practices.

Last year, I blogged on the hardcore use cases for big data in a discussion that was exclusively on the “opportunity” side of the equation. Later in the year, I observed that big data’s core “bigness” value derives from the ability of incremental content to reveal incremental context. More context is better than less when what you’re doing is analyzing data in order to ascertain its full significance. Likewise, more content is better than less when you’re trying to identify all of the variables, relationships, and patterns in your problem domain to a finer degree of granularity. The bottom line: More context plus more content usually equals more data.

Big data’s value is also in its ability to correct errors that are more likely to crop up at smaller scale. In that same post, I cited a third party who observed that, for a data scientist, having less data in their training set means they’re susceptible to several modeling risks. For starters, at smaller scales you’re more likely to overlook key predictive variables. You are also more likely to skew the model to nonrepresentative samples. In addition, you’re more likely to find spurious correlations that would disappear if you had a more complete data sets revealing the underlying relationships at work.

Scale can be beautiful

Everybody recognizes that some types of data and some use cases are more conducive than others to realizing fresh insights at scale.

In that vein, I recently came across a great article that spells out one specific category of data — sparse, fine-grained behavioral data — on which predictive performance often improves with scale. The authors, Junqué de Fortuny, Martens, and Provost, state that “a key aspect of such datasets is that they are sparse: For any given instance, the vast majority of the features have a value of zero or ‘not present.'”

What’s most noteworthy about this (and the authors support their discussion by citing ample research) is that this type of data is at the heart of many big data applications with a customer-analytics focus. Social media behavioral data fits this description, as do Web browsing behavioral data, mobile behavioral data, advertising response behavioral data, natural language behavioral data, and so on.

“Indeed,” the authors state, “for many of the most common business applications of predictive analytics, such as targeted marketing in banking and telecommunications, credit scoring, and attrition management, the data used for predictive analytics are very similar … [T]he features tend to be demographic, geographic, and psychographic characteristics of individuals, as well as statistics summarizing particular behaviors, such as their prior purchase behavior with the firm.”

The core reason why bigger behavioral data sets are usually better is simple, the authors state: “Certain telling behaviors may not be observed in sufficient numbers without massive data.” That’s because, in a sparse data set, no individual person whose behavior is being recorded is likely to exhibit more than a limited range of behaviors. But when you look across an entire population, you’re likely to observe every specific type of behavior being expressed at least once and perhaps numerous times within specific niches. At smaller data scales, looking at fewer subjects and observing fewer behavioral features, you’re likely to overlook much of this richness.

Predictive models thrive on the richness of the source behavioral data sets, in order to drive more accurate predictions across a wider range of future scenarios. Hence, bigger usually is better.

When bigger equals fuzzier

Nonetheless, the authors also note scenarios where this assumption falls apart, and it all has to do with the predictive value of specific behavioral features. Essentially, a trade-off underlies predictive behavioral modeling.

Each incremental new behavioral feature added to a predictive model should be sufficiently relevant to the prediction made so that it boosts the learning yield and predictive power of the model enough to overcome the ever-wider variances — hence over-fitting and predictive error — that tends to come with ever larger feature sets. As the authors state: “The large number of irrelevant features simply increases variance and the opportunity to over-fit, without the balancing opportunity of learning better models (presuming that one can actually select the right subset).”

Clearly, bigger isn’t better when bigness gets in the way of deriving predictive insights. You don’t want your big data analytics effort to be a victim of its own scale. Your data scientists have to be smart enough to know when to scale back their models to the hardcore of features best suited to the analytic task at hand.

This story, “When big data is truly better,” was originally published at InfoWorld.com. Read more of Extreme Analytics and follow the latest developments in big data at InfoWorld.com. For the latest developments in business technology news, follow InfoWorld.com on Twitter.