Delta Lake gives Apache Spark data sets new powers

news

Apr 24, 20192 mins

A new open source project from Databricks adds ACID transactions, versioning, and schema enforcement to Spark data sources that don't have them

Databricks, the company founded by the original developers of Apache Spark, has released Delta Lake, an open source storage layer for Spark that provides ACID transactions and other data-management functions for machine learning and other big data work.

Many kinds of data work need features like ACID transactions or schema enforcement for consistency, metadata management for security, and the ability to work with discrete versions of data. Features like those don’t come standard with every data source out there, so Delta Lake provides those features for any Spark DataFrame data source.

Delta Lake can be used as a drop-in replacement to access storage systems like HDFS. Data ingested into Spark through Delta Lake is stored in Parquet format in a cloud storage service of your choice. Devlopers can use their choice of Java, Python, or Scala to access Delta Lake’s API set.

Delta Lake supports most of the existing Spark SQL DataFrame functions for reading and writing data. It also supports Spark Structured Streaming as a source or destination, although not the DStream API. Every read and write through Delta Lake has an ACID transaction guarantee, so that multiple writers will have their writes serialized and multiple readers will see consistent snapshots.

Reading a specific version of a data set—what the Delta Lake documentation calls “time travel”—works by simply reading a DataFrame with an associated time stamp or version ID. Delta Lake also ensures the schema of the DataFrame being written matches the table it’s being written to; if there’s a mismatch, it throws an exception rather than change the schema. (Spark’s file APIs will replace the table in such a case.)

Future releases of Delta Lake may support more of Spark’s public API set, although DataFrameReader/Writer are the main focus for now.

Data ManagementMachine LearningOpen Source

by Serdar Yegulalp

Senior Writer

Follow Serdar Yegulalp on X

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

Show me more

Topics

About

Policies

Our Network

More

Delta Lake gives Apache Spark data sets new powers

A new open source project from Databricks adds ACID transactions, versioning, and schema enforcement to Spark data sources that don't have them

More from this author

I ran Qwen3.5 locally instead of Claude Code. Here’s what happened.

Migrating Python to Rust with Claude: What could go wrong?

First look: Electrobun for TypeScript-powered desktop apps

What I learned using Claude Sonnet to migrate Python to Rust

The best new features in MariaDB

Python’s popularity slip: Here’s what we know

What is Docker? The spark for the container revolution

First look: Run LLMs locally with LM Studio

Show me more

How to land a software development job in an AI-focused world

The agent security mess

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)