Serdar Yegulalp
Senior Writer

Big data brawlers: 4 challengers to Spark

news analysis
Aug 1, 20163 mins

Spark isn't the only option for handling big data at scale and in memory. These four contenders take on both stream processing and batch jobs

company fight lawsuit arguement
Credit: Thinkstock

Big (and even not so big) data hasn’t been the same since Apache Spark made inroads with developers and became a staple ingredient in big data clouds.

But Spark is far from perfect. It’s certainly improving, as version 2.0 shows, but if a competitor offers a better handle on what Spark does and more, developers will pay attention.

Here are four projects emerging as possible competition for Spark, with new approaches to handling the conventional in-memory batch processing Spark is famous for and the streaming Spark continues to work on.

Apache Apex

What it is: Originally created by DataTorrent, Apex has since been donated to the Apache Foundation. It performs both stream and batch processing on Hadoop under YARN.

How it competes: Apex’s streaming is the real deal, while Spark’s “streaming” is actually a microbatch system. It also has native support for fault-tolerance by way of Hadoop — though that means Apex and Hadoop are tightly coupled. Spark can work with or without Hadoop, and Apex doesn’t yet have Spark’s machine learning features.

Heron

What it is: Twitter’s replacement for the Apache Storm stream-processing framework, Heron is now available as an open source project. Consider this a contender for Spark streaming.

How it competes: Heron runs streaming jobs via containers managed through a scheduler. To that end, it not only scales more readily than other solutions, but is easier to debug, deploy, and keep running well on clusters. It’s also designed to appeal to existing Storm users, since it’s compatible with the Storm API and shares many of Storm’s concepts (such as “spouts” and “bolts”).

What it is: Apache Flink is a stream-processing library that competes with Apache Storm as much as Spark.

How it competes:  Like Apex, Flink puts streaming first, and it uses a true streaming model rather than Spark’s streaming via microbatch. Flink also has provisions for performing iterative or repeated processing on streams, and it includes Spark-like features, such as machine learning and graph processing libraries. But Flink is still a relatively new project, having hit 1.0 earlier this year.

Onyx

What it is: Onyx is a “masterless, cloud scale, fault tolerant, high performance distributed computation system,” according to its documentation, with both batch and stream processing models.

How it competes: Written in the functional language Clojure rather than Scala, Onyx puts streaming first — batch operations are basically implemented as ministreams. Onyx also allows the developer to use language constructs in either Clojure or Java, such as Clojure’s vectors and maps, to define how data is processed. (Many of Onyx’s goals were laid down before the code was even created.) If Onyx catches on, it’ll most likely be due to Java’s existing popularity rather than Clojure’s expressiveness.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author