Serdar Yegulalp
Senior Writer

Hadoop co-creator: Spark is great — but people want more

news analysis
Feb 1, 20164 mins

Doug Cutting anticipates growth ahead and opportunities all around for the Hadoop ecosystem

Ten years after its creation, the Hadoop ecosystem is sprawling and  ever transforming. InfoWorld’s Andy Oliver went as far as to say, “The biggest thing you need to know about Hadoop is that it isn’t Hadoop anymore” — at least, not Hadoop as we once knew it.

Hadoop co-creator Doug Cutting, now with Cloudera, sees all this change as not only a positive development, but as vindication of Hadoop’s open source origins and design.

In a phone conversation with InfoWorld, Cutting noted “having a loose confederation of a lot of open source projects permits evolution at a fundamental level.” In this model, the market determines which components are used.

Over time, individual parts of Hadoop’s ecosystem have grown beyond Hadoop itself. Case in point: Spark, the real-time data-processing framework, has developed an independent following.

However, Cutting believes the rest of Hadoop provides a lot that Spark can’t lay claim to. “Spark is a great execution engine,” he said, “and that’s where we see most Spark adoption, as an execution engine on top of HDFS.” (Spark typically replaces the older MapReduce engine, with YARN or Mesos, sometimes both, as a scheduler.)

But Cutting notes, “There’s a lot of things Spark isn’t.” For instance, it isn’t a full-text search engine; Solr assumes that role in the Hadoop world. One can run SQL queries against Spark, but it isn’t designed to be an interactive query engine; for that, Cutting said, there’s Impala.

“If all you need is streaming programming or batch programming, and you need an execution engine for that, Spark is great. But people want to do more things than that — they want to do interactive SQL, they want to do search, they want to do various sorts of real-time processing involving systems like Kafka…. I think anyone who says ‘Spark is the whole stack’ is doing a necessarily limited number of things.”

Another change over the years — by necessity — concerns security. Because of its origins as an internal tool within Yahoo, Hadoop had no real security to begin with, especially not the the finer-grained RBAC-type safeguards required of enterprise-grade products these days. “Folks building Web search engines and such tended to do security-by-firewall,” said Cutting, but he noted that Hadoop’s security is now fine-grained enough to include ACLs for tables or cells within tables.

Given Hadoop’s evolution, what implications does it have for the protection of data already in the system? “What we’ve seen more often,” said Cutting, “is that folks are required to address [data security] by their organization before they put something into production, before they store the data. That’s been a limiter on what sorts of things people build.” Now that Hadoop has more security features, he said, “it can be used in more places.”

Cutting mentioned two other limiters for Hadoop adoption: the skill sets of the users, and the rates at which enterprises build new systems. “Not everybody is up to speed on the tools,” he said of the former, and of the latter, “[Enterprises] mostly run existing systems; they don’t rewrite everything every year, so those things take time as well.”

Despite these obstacles, Cutting remains confident that the constant activity within the Hadoop ecosystem will keep it healthy. The Kudu filesystem, developed by Cloudera to merge features from HBase and HDFS, “shows how the ecosystem can evolve.”

Though it’s still technically alpha, some of Cloudera’s customers are already using it in production. But Cutting also remarked that Kudu has been integrated into other Hadoop engines, including Apache Drill (which isn’t included in Cloudera’s distribution). 

“That other people have been voting to embrace [Kudu] is a real vote that it’s something of interest,” said Cutting.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author