Spark 1.6 feeds big data’s hunger for memory

news analysis

Nov 23, 20153 mins

The forthcoming version of the popular data processing framework makes more efficient use of memory, addressing a common developers' complaint

Those curious about what’s coming in Apache Spark 1.6 can now get hands-on experience with the latest changes to the big data processing framework.

Databricks, a major commercial contributor to Spark and provider of a cloud-hosted platform for running Spark applications, is offering Spark 1.6 in preview on its services. Those who want to try out Spark on their own can obtain Spark 1.6 pre-release code directly from Apache.

More headroom

This new version packs major changes to the way Spark handles memory. Earlier editions of Spark used to subdivide available memory into two partitions, one for data and one for execution, and required users to figure out how much memory to split between the two. This might have been one of the reasons for InfoWorld contributor Ian Pointer’s complaints about Spark’s memory management.

In version 1.6, execution memory and storage memory can borrow from each other as needed. “For many applications,” says Databricks in its blog post, “this will mean a significant increase in available memory that can be used for operators, such as joins and aggregations.”

There are still some restrictions in the current implementation; while borrowed execution memory can be released if needed, borrowed storage memory is never released. For backward compatibility, 1.6 also includes a legacy, fixed-partition memory management mode.

Version 1.6 also continues 1.5’s work on Project Tungsten, a major initiative to rewrite Spark’s low-level handling of memory and CPU. As Spark picks up speed, both in terms of its performance and its popularity with developers, more of its performance bottlenecks are turning out to be limits of the JVM itself, such as its garbage collector.

Keeping it convenient

Another key addition to Spark 1.6 is the Dataset API for working with collections of typed objects. Previously, Spark users had a choice of two APIs for interacting with data: RDDs, essentially collections of native Java objects (useful, but slow); and DataFrames, collections of structured binary data (less safe, but faster).

Datasets meld the advantages of both: the static typing and user functions of RDDs, and the compile-time type checking and better performance of DataFrames. Datasets and DataFrames are also meant to interoperate freely; libraries that accept one kind of data can accept the other as well.

Spark’s real-time data component, Spark Streaming, also gets a boost due to a change to its state-tracking API. This makes it easier for programs that store a large amount of state information when using streams (such as for user sessions), so the impact on memory only varies with the changes to that state, rather than the entire size of the state. Databricks claims this can provide as much as a tenfold improvement in some workloads, and since Spark Streaming has its share of controversies about performance, any improvement will likely be welcomed.

Other changes aren’t as major, but useful: the ability to have persistent pipelines in Spark ML (so jobs can be suspended and resumed); the ability to perform Spark SQL queries on flat files in supported formats; some new machine learning algorithms; and improvements to the R and Python APIs. The last one may help open up Spark’s reach; while Python remains a prime language of choice for science and math, API support for that language in Spark has traditionally lagged.

Data ManagementSoftware DevelopmentAnalyticsOpen Source

by Serdar Yegulalp

Senior Writer

Follow Serdar Yegulalp on X

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

Show me more

Topics

About

Policies

Our Network

More

Spark 1.6 feeds big data’s hunger for memory

The forthcoming version of the popular data processing framework makes more efficient use of memory, addressing a common developers' complaint

More headroom

Keeping it convenient

More from this author

I ran Qwen3.5 locally instead of Claude Code. Here’s what happened.

Migrating Python to Rust with Claude: What could go wrong?

First look: Electrobun for TypeScript-powered desktop apps

What I learned using Claude Sonnet to migrate Python to Rust

The best new features in MariaDB

Python’s popularity slip: Here’s what we know

What is Docker? The spark for the container revolution

First look: Run LLMs locally with LM Studio

Show me more

How to land a software development job in an AI-focused world

The agent security mess

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)