Serdar Yegulalp
Senior Writer

Apache Spark powers live SQL analytics in SnappyData

news analysis
Apr 27, 20163 mins

The same team that created GemFire builds on Spark in a new open source database that can analyze OLTP and OLAP workloads side-by-side

The team behind Pivotal’s GemFire in-memory transactional data store recently unveiled a new database solution powered by GemFire and Apache Spark, called SnappyData.

SnappyData is another recent example of Spark employed as a component in a larger database solution, with or without other pieces from Apache Hadoop.

Snap and spark

SnappyData — the name of both the new database and the organization producing it — was built to span two worlds. It uses the Apache Spark in-memory data-analytics engine so that it can perform live SQL analytics on both static data sets and streams. Queries against SnappyData can be written as conventional SQL or as Spark abstractions, so existing work done in both paradigms can be reused, alone or together, on the same data.

To store and retrieve the data, SnappyData has a distributed data store called Snappy-Store, derived from a variant of Pivotal’s GemFire technology. It works as either its own data store or as a sort of asynchronous write-back cache to other data sources, such as Hadoop/HDFS. This implies that existing data sets could be accessed through SnappyData without having to be formally migrated.

SnappyData also tries to offer novel solutions to problems that can arise when using streaming data. For instance, if there’s too much data coming through to get a real-time response to a query in a timely fashion, SnappyData uses approximate query processing (AQP) or a method of sampling streaming data to generate an answer.

The results are less exact than operating on the entire data set, and AQP isn’t available for every kind of query. That said, AQP queries are intended to be faster to run and are less demanding of CPU and memory than working on the full data set.

One among many

This isn’t the first time Spark has been used at the heart of a data analysis solution that covers both OLTP and OLAP workloads. In-memory database system Splice Machine was originally built on top of Hadoop components and leveraged them to scale out and be able to run both OLTP and OLAP jobs under the same hood. Version 2.0 of that product added Spark as an OLAP processing engine.

Where SnappyData diverges from Splice Machine, though, is in how Spark is used. SnappyData claims it’s extending Spark Streaming in various manners, such as allowing streams to be manipulated and queried as though they were tables, including operations like joins.

SnappyData also seems like a good environment to leverage changes that are slated for Apache Spark in the near term. For instance, Spark 2.0, scheduled to come out later this year, will heavily rework how Spark handles memory management and introduce changes to its streaming system that make it easier to pull down streaming data.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author