A conversation with IBM’s Mr. Big Data

Rod Smith, IBM's vice president of emerging Internet technologies, tells InfoWorld about IBM's exploits in Big Data -- this year's hottest trend

Rod Smith has one of the most enviable titles around: vice president of emerging Internet technologies. He earned it. My first encounter with him goes back to the early days of SaaS (software as a service) when he was IBM’s point man on the topic. But he is probably best known for his key role in the development of IBM’s WebSphere line of middleware, as well as for his early advocacy of XML, Web services, and J2EE.

Last week, the day after IBM’s 100th anniversary celebration, I caught up with Smith at the Strata conference on “big data” — that is, the huge globs of unstructured data generated by Web clickstreams, system and security logs, distributed sensors, truckloads of text, and just about anything else you can name.

Teasing value from data once considered too amorphous to exploit is Smith’s current obsession — not surprising, since this is one of the most exciting areas of emerging technology. Smith leads strategy and planning for IBM’s Big Data practice, including IBM InfoSphere BigInsights, a collection of analytics and visualization technologies centering on Hadoop. I began our conversation by asking Smith about the origins of his involvement with Big Data.

Eric Knorr: When did you first encounter Big Data? My guess is that it was before it was called that.

Rod Smith: It was. When we went to customers and talked about just processing data, they kept saying, “Databases, we know what we know about them, but there’s data out there that we think has value — but we don’t know. We think it has insights for us. But we don’t want to pick it up and put it in a database with all the management costs that go with that, and then find it doesn’t mean anything. So we need something we can use to discover insights quickly — or not.”

It’s kind of like a cycle of exploration, but traditional handling of data doesn’t do that. You go through the process of bringing it in and cleansing it and normalizing it. But they said, “That’s not what we want. We don’t know if data from Twitter is going to be valuable until we see something there that makes us go, ‘ah ha, now we know what we can do with it!'”

One of the first customers that asked for a proof of concept was the BBC. They had an effort called Digital Democracy, and they were looking at how they could help journalists be much more efficient writing in-depth articles. It takes a long time to really sift through information. So I said, “That’s interesting.” We didn’t know what they wanted us to do yet. So they said, “We’re not quite ready to get our information from our side of it, but could you go out and read in all the Parliament information and then tell us what Parliament members were interested in what bills, what bills were getting buzzed, who was working on them, how long they’d been working on them?” And they gave us a list of interesting questions. And so that’s where we started, and that’s Big Data. Not necessarily in the terabyte sense, but in the sense of cost-intensive people trying to work with it.

Knorr: And it’s unstructured.

Smith: And it’s unstructured; or semi-structured, as people call it. But we like the term “big data” because data folks have been forced to define different types of data, as opposed to the business person who just says, “I don’t care if it’s structured or unstructured or whatever, I just want to get this information from it. And you confuse me by telling me how it’s done. I don’t know the how. I don’t care. I just want to get these insights from it.” And that was really how we got started running these things and using Many Eyes, in the BBC case, to do the visualizations.

Knorr: And was this using MapReduce techniques?

Smith: Yes, this was all Hadoop-based. But then we heard back from customers, like the BBC. The first try of giving an application to an end-user wasn’t very good. But they gave us lots of feedback, such as, “Here’s what we had in mind. Think about it maybe as a spreadsheet more. We know spreadsheets. How close could you get to doing that?” That’s the type of iteration you go through with customers when you’re not part of a product team, because if they don’t like it they’re very forward with it and lay it out. And that’s what they want because they know that time is valuable, they appreciate our talking to them, they appreciate the kind of insights we bring, but if it’s not going to work, they don’t want us to lose time.

Knorr: So how would you put Big Data in context with the larger sphere of business intelligence? Obviously it’s unstructured versus structured SQL data, but what about in terms of applications? To me, it seems there’s a higher failure rate in business intelligence projects than in other IT projects.

Smith: I’ll give you some interesting facts. IBM put out a CIO study two years ago, and business intelligence was the No. 1 thing CIOs cared about. Virtualization second, mobile third. Now, you’d think the CIO would make every effort to get his data into the hands of the BI experts. The opposite was true. We asked them, “Do you make your data easily available for the lines of business?” It was like 12 percent. So the poor guy at the top of the business who says, “I should be able to use BI to get good answers,” doesn’t realize the IT department is not really helping them. So the poor BI folks are trying to extract enough information, but in many cases it’s kind of on a shoestring.

That’s No. 1. I think the part that you’re seeing around Big Data is you’d like to … discover repeatable business patterns. Then you want to go to BI or Cognos and say, “Ah, now how do I model this?” And that up-front preparation, without the new technologies we’re talking about, can mean a long cycle. You have to sit with the customer and figure out what they’re doing and figure out the data, and it takes a lot of time. So I think what Big Data starts to offer us is a discovery dimension. I’d like to be able to discover patterns and actionable insights, so I can now turn to my experts in business analytics and business intelligence and say, “I can now describe this better to you. I can tell you which data is going to be important. I can tell you how I’m thinking about how you should figure your models or build your models around this.”

Knorr: So you see it as an exploratory tool.

Smith: I think it’s an exploratory tool that we haven’t had before that adds to this whole area of business analytics/business intelligence. It’s been very labor-intensive up front, and now we’re able, with technologies like BigInsights, to answer the question of how do we help people sift through data where maybe 90 percent of it is not very useful?

Knorr: Well, and you don’t have to worry about the usual cleansing and coherency and all that other stuff because by nature it’s dirty, it’s all over the place.

Smith: And in many cases, when data isn’t clean, it tells you more stories about itself. We did a proof of concept with a customer — and then figured out how we could do it live with real data when the iPhone 4 came out. So we went to Twitter, and over 36 hours we collected 375,000 tweets. And then we used a sediment analysis to go through and find only those people who were interested in certain phones and whether they were interested in buying or purchasing — you look for certain words for that.

And then we did a tag cloud. You look and say, “OK, which phones were more popular?”

So what happens with dirty data? Well, in the tag cloud you saw Android — and Android misspelled. But if you had cleaned the data, you might have removed that entry and the weight factor wouldn’t have been the same. But now that you see it, you can go back and correct it and get a more realistic view. Traditional data folks say cleansing is important, and I agree, but I don’t want to cleanse out the context. And context is going to be the shell around this that allows you then to say, “Yes, now I know how I want to work!” — and prep the data for other types of intelligence.

Knorr: Well, is that its only role for Big Data, as the front end of a longer, more conventional business intelligence process?

Smith: I don’t think so. For example, there’s the idea that you can … take more than one data source and put it together very quickly. If the database has to join, that’s a lot of work. For us it’s putting it into a distributed file system; then we know how to go through and read it at that point. So there’s combining data, there’s sifting through the data, working with a very broad array, depending on where the customer wants to go. So it will build up the analytics more interpreted — rather than fixed, which costs a lot to change.

Knorr: The scenarios you always hear about are Yahoo and Facebook. Clickstream data — terabytes of it. Never having seen these applications, I’m thinking that maybe there’s some sort of visualization that emerges from a Hadoop process that you can just use in and of itself and say, “We’d better change our stupid security stuff” on Facebook or whatever.

Smith: I’ll give you an example. You have to put data in the context of business patterns. So a person who is a chief legal officer does mergers and acquisitions. One such person asked us, “Could you read in all the patent office information?” So if you wanted to do that in a database, you could, but you’d have a lot of stuff to manage and keep around as opposed to what we actually did, which was read it in. Gee, then look just for the patents, and maybe one percent is that company’s patents. We ranked them according to how many people were referencing them. And the customer says, “That’s interesting. That gives some weight value to it.”

Only a couple of patents turned out to be referenced more than once. And then the customer says, “You know what? Could you pull in Federal 9th District Court information so I could see if anything is being litigated around these?” But it’s much more of an “ah ha!” I mean, “If you can do that, then can you go get that other piece of information? I’d like to interpret more information.” So it’s much more of a collaboration with a line of business to determine what they’re after, because they don’t know until they see something.

Knorr: It sounds like a different mindset.

Smith: It’s very different. What most folks have been used to is you ask IT to do it, they go off for a long period of time, and they come back with it. And not to be unfair to them — they’re trying to think about it in the context of other IT applications — but the end result might have no value to me at all. I’d like to know quickly if it’s going to be good or bad, not have you build it like it is going to be good, and I look like an idiot because it’s not very good.

Knorr: Put Big Data in the context of IBM’s business intelligence acquisitions. You’ve got Cognos, you’ve got Netezza, you’ve got SPSS — there have been a bunch of acquisitions in the past several years.

Smith: That’s for sure. I think we’ll see Big Data as a resource for all those types of solutions. It depends on where the customer is in their business cycle, if you will.

Let’s say a customer is in the discovery phase. Then things like BigInsights that we’re doing helps them with that. But then they find that it’s a good repeatable pattern; they’d like to export that data they sifted down into Cognos or into SPSS. And SPSS and Cognos can do more of the modeling around those things at that point.

As for Netezza, you can think of it as the appliance that you can put a BigInsights on and really crank up the processing. It’s really solving business problems that you wouldn’t have thought were traditional analytics or intelligence. And I think that’s the part that we like: How can we help you look at data early on and get some insights on how to change your business? And then how do we help you with Cognos or SPSS or other things to work through the different stages of that?

Knorr: And when you say, “How do we help you?,” do you anticipate using this as a tool in, say, consulting engagements or applications?

Smith: I think it’ll be services, as well as what we’re going to do from a software standpoint. But I think both of them are important topics.

Knorr: Can you talk about about firm plans for IBM applications?

Smith: We try to build applications around what customers have asked us for. It’s fascinating how these domain experts want to use very large data almost in a personal computing manner. It’s not on their machine at the desktop, it’s someplace over there, but they want to manipulate that data like it was right there at their fingertips. I’ll mention one specific application, which is what we’re going to do with Watson on “Jeopardy.”

Knorr: Oh, yes. But how does Big Data play into that?

Smith: Well, they use a number of technologies as the foundation, one of them being Hadoop. They read in — what is it? — a million books, 2 hundred million pages. And then they have specialized analytics that go through and do the analysis and categorize things, very much on the text and natural language part of it, machine learning. And then come out with an application.

Now, in two weeks, when Watson is on “Jeopardy,” you’re not going to hear about the technology underneath it. Nobody except us are going to care about that. But that’s the cool thing, in my mind. What’s the technology that makes this work differently than I’ve seen before? I’m just going to take for granted that these new technologies like Hadoop and others are enabling that application you would not have thought about before.

Knorr: Give me some other examples.

Smith: Well, medical. Maybe you want to come back with a half a dozen answers that look feasible. Just like with Watson, you’d like to trace back exactly how it came to those answers so you could understand how it arrived there. We’ve done some things with doctors and BigInsights. One of their points is, “As you create data and process it, I’d like to see every step, backwards and forwards, because another researcher might look and say — oh, you took this assumption and went there. I might want to take that and go this way with it.”

And I think that’s the value of this, really, is that it puts it in the hands of more of these professionals. Because otherwise it goes to IT, IT writes it, goes back and says, “What do you think?”

“Not quite what I had in mind.” It adds development costs, it adds all the other standardized types of things, without any new value added to the business.

Knorr: Well, that poses an interesting challenge, doesn’t it? Because right now, in most cases, there’s some sort of BI specialist who actually stands between the technology and the end-user. And usually that’s a good thing, or they’re going to end up asking the wrong questions and drawing the wrong conclusions. The domain areas that you’re exploring are incredibly diverse. It’s not simply a narrow business context; it can be all sorts of contexts. So to find these patterns, the interaction of the end-user or the consumer of the data should be much more direct than in more normal business intelligence.

Smith: I think it should be, and it’s also richer for everybody, because otherwise it’s just like compartmentalizing someone and saying, “Please look at this from a BI perspective.” As opposed to, “You know, here’s my hunch on what might be happening. Now let’s go do that. And now, did my hunch pay off or not?” And so now a BA or BI expert is like, “Ah! Maybe I know where to go look for part of these things, too.”

I think some of these cases will show you that using Hadoop and using BigInsights, I can get you answers in a couple of minutes. So it feels like my cost is much smaller. That’s very interesting. I like that. As opposed to more traditional IT- or BI-type things, which take some time. So I think this opens the aperture for better collaborations, richer collaborations, with business analysts and business professionals. And I think the term “analytics” is too narrow. It just doesn’t add up to the business problems we’re seeing out there.

Knorr: But this kind of circles back again to the idea of new types of applications that can allow this sort of correct interaction.

Smith: Yep. And this is why Watson is an interesting one. It’s a new type of application of what we’re trying to do.

Knorr: When can we expect to see new products coming out, and what part of IBM will they be coming out of?

Smith: Well, I can’t talk about the other ones yet. But I think you’ll see that Big Data is going to be kind of a horizontal component to a lot of what we do. You mentioned log files, and Tivoli does log files. There are different domains in here where Big Data touches us, and we’d like to be in a position of offering those kinds of capabilities for a broad range of our solutions.

Knorr: Great. Thank you very much.

Smith: Thank you.

This article, “A conversation with IBM’s Mr. Big Data,” originally appeared at InfoWorld.com. Read more of Eric Knorr’s Modernizing IT blog, and for the latest business technology news, follow InfoWorld on Twitter.

Business IntelligenceData Management

Topics

About

Policies

Our Network

More

A conversation with IBM’s Mr. Big Data

Rod Smith, IBM's vice president of emerging Internet technologies, tells InfoWorld about IBM's exploits in Big Data -- this year's hottest trend

More from this author

Can AI solve IT’s eternal data problem?

The great cloud computing surge

The multicloud challenge: Building the future everywhere

The 2020 IDG Cloud Computing Survey

The state of cloud computing in 2020

Containers march into the mainstream

Containers march into the mainstream

IBM Cloud Q&A: Kubernetes takes center stage

Show me more

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

Google’s Stitch UI design tool is now AI-powered

Stop using AI to submit bug reports, says Google

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)