Serdar Yegulalp
Senior Writer

Yahoo schools Spark on deep learning

news analysis
Feb 25, 20162 mins

CaffeOnSpark works with the big data platform to create applications like speech or image recognition

Yahoo has created a deep learning system for creating predictive applications like speech or image recognition. While these systems are already delivered by open source projects like Google TensorFlow or Microsoft CNTK, Yahoo stands apart by leveraging a major force in big data processing: Spark.

CaffeOnSpark, introduced yesterday in a blog post, builds on the Caffe deep-learning framework developed by the Berkeley Vision and Learning Center (Yahoo is one of its sponsors).

Spark features an array of machine learning algorithms, introducing new ones in each successive revision. But deep learning — training a neural net with a mass of data and using it to make decisions — isn’t part of its portfolio.

CaffeOnSpark addresses that by accepting data prepared by a Spark application and allowing the resulting predictions to be extracted by Spark via SQL query or its other machine learning libraries.

Yahoo

CaffeOnSpark melds Spark’s in-memory processing with the Caffe deep learning framework, allowing Spark users to train datasets and derive insights from models via system they’re already comfortable with.

The Spark and Caffe nodes can sit side by side on the same hardware, meaning the data doesn’t have to be moved around as much and thus can be processed faster. Training jobs can also have their state periodically checkpointed, so a long-running job can be paused and resumed, or recovered in the event of a crash.

Launching applications and running processing in CaffeOnSpark are done by way of the existing Spark command set, for the sake of familiarity. Also, the existing Spark command set launches applications and runs processing in CaffeOnSpark. But CaffeOnSpark instances running on different nodes don’t communicate with each other through Spark. Instead, they use their own system, MPI, which can be routed over Ethernet or RDMA/Infiniband, to avoid bottlenecks.

The biggest advantage to CaffeOnSpark is its use of an existing big data processing tool that’s already achieved a great deal of user and developer momentum. Google and Microsoft tout ease of use as chief advantages of their solutions, but familiar tools always help the transition to a new workflow or data paradigm, especially given Spark’s reputation for accessibility and simplicity.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author