Why you should jump into big data

At a low cost of entry, the emerging technologies that define the big data trend are already delivering value, so first consider the problems you need to solve -- then dive in

Already, “big data” has become one of those buzzphrases you say with an apologetic smirk. It sounds like marketecture, broad enough to apply to almost anything.

So let’s clear up what big data is and isn’t. Perhaps you’ve heard the canonical “three V’s” definition: data high in volume, velocity, and variability. In other words, big data comes in multiterabyte quantities, accrues or changes fast, often resists normalized structure — and tends to demand technologies beyond the tried-and-true RDBMS or data warehouse.

[ InfoWorld’s Andrew Lampitt looks beyond the hype and examines big data at work in his blog Think Big Data. | Confused about Hadoop? Our Hadoop Deep Dive puts the technology in perspective. | Download InfoWorld’s Big Data Analytics Deep Dive for a comprehensive, practical overview. ]

That cluster of new technologies around big data — including Hadoop, a wild array of new NoSQL databases, massively parallel processing (MPP) analytic databases, and more — together represent the biggest leap forward in data management and analytics since the 1980s. That’s really what big data is about. And these emerging technologies are already delivering business value: in deep insights about customer behavior, in faster app dev cycles, in the ability to use commodity hardware, and in reduced software licensing costs, because almost all these new technologies are open source.

Assuming your data volumes are exploding as fast as everyone else’s, you’re part of the big data trend whether you like it or not. So why not employ the tools purpose-built for the big data era? It’s a better strategy than blindly buying more Oracle licenses or building another gold-plated data warehouse. Where you start, though, depends on the problems you want to solve.

Problem No. 1: I don’t want to pay Oracle more money This is not a big data problem per se, but software surrounding the big data trend may help solve it.

Many companies simply use Oracle (or DB2 or SQL Server) as their default data store for almost everything. After all, the RDBMS is probably the most successful technology in the history of software, and if you want a battle-tested, unassailable RDBMS with all the bells and whistles, you choose Oracle (or other ironclad commercially licensed software) and pay a lot for it. That’s where data goes, period.

But now the RDBMS has all sorts of viable competition. As it turns out, there are many, many instances where database needs do not include relational capability, two-phase commits, complex transactions, and so on. In such cases, NoSQL solutions — most of which are open source — may perform and scale better at vastly reduced cost and with much lower maintenance overhead. For an overview of NoSQL database types, see “Which freaking database should I use?” by InfoWorld’s Andrew Oliver.

Now, nobody would power down their Oracle servers and port all their existing customer and product data to, say, MongoDB. For one thing, the security isn’t there yet — and by their nature NoSQL databases tend to compromise ACID compliance. Also, when complex transactions are involved, even NoSQL vendors will tell you that an RDBMS remains your best solution. Finally, if you just want to save money, you’re not going to waste a fortune rearchitecting an Oracle database and its applications for NoSQL (an open source alternative like PostgreSQL might be a better choice).

But for new projects, especially those involving Web applications that demand instant scalability — or analytics systems intended to crunch gobs of semistructured data — exciting alternatives beckon. Not only are they mostly open source, they run on low-cost server hardware.

Problem No. 2: I can’t get what I want from BI Business intelligence always seems to rank among the top few technology priorities for big companies. Yet year after year, few seem very happy with the results.

It all boils down to the questions you want to ask. If you have queries related to, say, the regional distribution of your transactions or trends in the costs of your materials — or if you want to make some predictions about how all that may play out next year — conventional business intelligence and analytics systems probably remain your best bet.

But if you want to ask something like, “How are millions of customers using my Web applications and how might I improve them?,” you’re better served by a solution built around an engine to handle semistructured data such as Hadoop. In fact, Hadoop is mentioned in the same breath with big data so often you’d think the two were interchangeable.

Hadoop was purpose-built to process very large quantities of semistructured data, such as the clickstream of events left by Web users. InfoWorld’s Andrew Lampitt has explored some great examples of this, including the use of Hadoop by Facebook, Experian, and Evernote. Hadoop has two components: HDFS (Hadoop Distributed File System), which provides cheap scale-out storage; and MapReduce, the data processing layer that provides a framework for developing analytics applications.

But it’s important to note what the CEO of Cloudera, Mike Olson, told InfoWorld recently: “Nobody stands up Hadoop by itself. It’s usually next to a relational database and maybe in service of a document system.” Hadoop tends to sit on the back end of big data solutions, delivering results to other databases or applications. For example, at Evernote, Hadoop connects to a ParAccel MPP analytics system, with JasperReports ultimately delivering insights to Evernote staffers.

While anyone can download and play with a Hadoop distribution, be aware that you’re ultimately going to need to grow or acquire experts to get Hadoop to do the tricks you want it to do. This is a moving target, thanks to a parade of new SQL querying interfaces, prebuilt applications, and Hadoop distributions preintegrated with conventional RDBMS offerings. Soon a lot more people will be able to query vast stores of semistructured data to get the answers they need.

Problem No. 3: Help! I can’t move fast enough! In the good old days, databases were a lot easier to spec out. If you were a big enterprise scoping out a new order entry system, you probably had a solid idea of how many people would use it, when the peak demand would be, and how frequently (or infrequently) the data model would change.

That was before the “agile” days of the Web. Now companies experiment with all kinds of new applications, many of them public-facing Web apps. Some wither quickly because no one finds them compelling; others may explode in popularity and turn the database into a bottleneck overnight. Moreover, shifts in customer needs, brainstorms for new enhancements, and so on demand a fluid data model.

With an RDBMS, data needs to fit into rows and columns, and required fields rule, so a request to alter the data model kicks off an elaborate change management process. On the other hand, NoSQL databases such as Couchbase, MongoDB, or Cassandra are not intended to enforce rigid data structures, so new data elements can be added on the fly.

As for ramping up capacity, a big part of the NoSQL value proposition is the ability to scale out rather than up. In other words, with NoSQL, you can simply add commodity servers as needed. In contrast, with an RDBMS, you need to upgrade the horsepower of a single RDBMS server, and when you need to add more RDBMS servers, you must “shard” the database across them, which incurs other complications.

Aside from extreme scalability, what are NoSQL databases good for? You’ll be amazed to learn that NoSQL database software vendors are inclined to say “almost everything.” In a recent InfoWorld interview, 10gen CEO Dwight Merriman offered a good example: “One telco wrote a product catalog application for their company, a giant company with 100,000 products. Some of them are phones, some of them are extended warranties, and some of them are service plans. They have all these different properties to their products. They found it was very easy to do that with MongoDB because of the way the data model works.”

Shoot first, aim later Another characteristic of NoSQL databases, including Hadoop in its own way, is that developers tend to like writing to them a lot more than they enjoy working with relational databases. NoSQL lends itself the shorter dev cycles characteristic of agile development. That yields faster development times on top of breaking the bottleneck of rigid data models.

Data security is still a concern. But all the enterprise-class vendors building on the major open source projects — Cassandra, Couchbase, Hadoop, MongoDB — are adding security controls at a furious rate.

Meanwhile, there’s little excuse to sit on the sidelines. Almost all new big data software comes in an open source version; in many organizations, developers are already downloading and experimenting whether management knows it or not.

As for analytics, everyone is accumulating terabytes of semistructured data by default for fear of breaking compliance regulations, so why not derive some insight from that obnoxious quantity of bits? In some instances, IT can benefit directly. In the case study “Big data drives high performance for Cars.com,” for example, you can see how a major website used the big data tool Splunk to ensure snappy application performance and defend against malicious bots.

The cost of entry is low, and the potential benefits are high. You don’t need to jump into big data with a fully baked strategy; in fact, that would run counter to the whole idea. Wade in, experiment with Hadoop and NoSQL technologies, and see what works. As you build, you’ll discover along the way which investments of time, effort, and money have the potential to pay off most.

This article, “Why you should jump into big data,” originally appeared at InfoWorld.com. Read more of Eric Knorr’s Modernizing IT blog. And for the latest business technology news, follow InfoWorld on Twitter.

Data ManagementBusiness IntelligenceOpen SourceDatabasesTechnology Industry

Topics

About

Policies

Our Network

More

Why you should jump into big data

At a low cost of entry, the emerging technologies that define the big data trend are already delivering value, so first consider the problems you need to solve -- then dive in

More from this author

Can AI solve IT’s eternal data problem?

The great cloud computing surge

The multicloud challenge: Building the future everywhere

The 2020 IDG Cloud Computing Survey

The state of cloud computing in 2020

Containers march into the mainstream

Containers march into the mainstream

IBM Cloud Q&A: Kubernetes takes center stage

Show me more

New ‘StoatWaffle’ malware auto‑executes attacks on developers

Designing self-healing microservices with recovery-aware redrive frameworks

7 safeguards for observable AI agents

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)