Serdar Yegulalp
Senior Writer

MIT-Stanford project uses LLVM to break big data bottlenecks

news
Mar 20, 20173 mins

Written in Rust, Weld can provide orders-of-magnitude speedups to Spark and TensorFlow

The more cores you can use, the better — especially with big data. But the easier a big data framework is to work with, the harder it is for the resulting pipelines, such as TensorFlow plus Apache Spark, to run in parallel as a single unit.

Researchers from MIT CSAIL, the home of envelope-pushing big data acceleration projects like Milk and Tapir, have paired with the Stanford InfoLab to create a possible solution. Written in the Rust language, Weld generates code for an entire data analysis workflow that runs efficiently in parallel using the LLVM compiler framework.

The group describes Weld as a “common runtime for data analytics” that takes the disjointed pieces of a modern data processing stack and optimizes them in concert. Each individual piece runs fast, but “data movement across the [different] functions can dominate the execution time.”

In other words, the pipeline spends more time moving data back and forth between pieces than actually doing work on it. Weld creates a runtime that each library can plug into, providing a common method to run key data across the pipeline that needs parallelization and optimization.

Frameworks don’t generate code for the runtime themselves. Instead, they call Weld via an API that describes what kind of work is being done. Weld then uses LLVM to generate code that automatically includes optimizations like multithreading or the Intel AV2 processor extensions for high-speed vector math.

InfoLab put together preliminary benchmarks comparing the native versions of Spark SQL, NumPy, TensorFlow, and the Python math-and-stats framework Pandas with their Weld-accelerated counterparts. The most dramatic speedups came with the NumPy-plus-Pandas benchmark, where the work could be amplified “by up to two orders of magnitude” when parallelized across 12 cores.

Those familiar with Pandas and want to take Weld for a spin can check out Grizzly, a custom implementation of Weld with Pandas.

It’s not the pipeline, it’s the pieces

Weld’s approach comes out of what its creators believe is a fundamental problem with the current state of big data processing frameworks. The individual pieces aren’t slow; most of the bottlenecks arise from having to hook them together in the first place.

Building a new pipeline integrated from the inside out isn’t the answer, either. People want to use existing libraries, like Spark and TensorFlow. Dumping that means getting rid of a culture of software already built around those products.

Instead, Weld proposes making changes to the internals of those libraries, so they can work with the Weld runtime. Application code that, say, uses Spark wouldn’t have to change at all. Thus, the burden of the work would fall on the people best suited to making those changes — the library and framework maintainers — and not on those constructing apps from those pieces.

Weld also shows that LLVM is a go-to technology for systems that generate code on demand for specific applications, instead of forcing developers to hand-roll custom optimizations. MIT’s previous project, Tapir, used a modified version of LLVM to automatically generate code that can run in parallel across multiple cores.

Another cutting-edge aspect to Weld: it was written in Rust, Mozilla’s language for fast, safe software development. Despite its relative youth, Rust has an active and growing community of professional developers frustrated with having to compromise safety for speed or vice versa. There’s been talk of rewriting existing applications in Rust, but it’s tough to fight the inertia. Greenfield efforts like Weld, with no existing dependencies, are likely to become the standard-bearers for the language as it matures.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author