Pin this! Faster Hadoop queries … from Pinterest?

news analysis

Sep 15, 20152 mins

More than a repository for food and crafts, Pinterest has created an open source querying system that pulls job-generated data out of Hadoop and fixes back-end bottlenecks

Pinterest has rolled out an open source solution to make it easier to query large data sets when working with Hadoop.

Terrapin, announced at Facebook’s @Scale conference, was originally devised by Pinterest to replace the scalable Hadoop data store HBase. The idea was to provide a fast way to store and run key-value queries against large immutable data sets generated by MapReduce jobs or stored in S3 or HDFS volumes.

According to the blog post describing Terrapin, Pinterest ran into serious performance bottlenecks when writing hundreds of gigabytes of data to HBase. Bulk uploading solved the performance problem but caused the inserted data to be scattered across a cluster, which hurt performance all over again.

pinterest terrapin — Terrapin ingests data, either already stored or generated by Hadoop jobs, and serves it from HFiles on top of HDFS. Requests for data can be served from multiple clusters at once to lower latency.

Terrapin uses HBase’s existing HFile format for data storage on top of HDFS, but provides its own custom system for figuring out where queried HFile data lives and serving it up from the needed node.

Data fed into Terrapin can come from a variety of sources — for example, a MapReduce job that writes directly to Terrapin or HDFS/S3/Hive sources. HFiles can also be “live swapped,” with a newer data set replacing an older one on the fly.

Much of the Hadoop infrastructure in use leverages MapReduce, but interest and pressure are mounting to make Spark the new centerpiece for Hadoop. The current integration between Terrapin and Spark consists largely of having Spark write HFiles, but Terrapin’s storage format system is extensible. In theory, any number of Hadoop data storage formats that can be queried via keys (Parquet, for instance) could get their own connectors in time.

Data ManagementOpen SourceDatabases

by Serdar Yegulalp

Senior Writer

Follow Serdar Yegulalp on X

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

Show me more

Topics

About

Policies

Our Network

More

Pin this! Faster Hadoop queries … from Pinterest?

More than a repository for food and crafts, Pinterest has created an open source querying system that pulls job-generated data out of Hadoop and fixes back-end bottlenecks

More from this author

I ran Qwen3.5 locally instead of Claude Code. Here’s what happened.

Migrating Python to Rust with Claude: What could go wrong?

First look: Electrobun for TypeScript-powered desktop apps

What I learned using Claude Sonnet to migrate Python to Rust

The best new features in MariaDB

Python’s popularity slip: Here’s what we know

What is Docker? The spark for the container revolution

First look: Run LLMs locally with LM Studio

Show me more

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

Google’s Stitch UI design tool is now AI-powered

The ‘toggle-away’ efficiencies: Cutting AI costs inside the training loop

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)