Serdar Yegulalp
Senior Writer

Pin this! Faster Hadoop queries … from Pinterest?

news analysis
Sep 15, 20152 mins

More than a repository for food and crafts, Pinterest has created an open source querying system that pulls job-generated data out of Hadoop and fixes back-end bottlenecks

pinterest for business
Credit: Thinkstock

Pinterest has rolled out an open source solution to make it easier to query large data sets when working with Hadoop.

Terrapin, announced at Facebook’s @Scale conference, was originally devised by Pinterest to replace the scalable Hadoop data store HBase. The idea was to provide a fast way to store and run key-value queries against large immutable data sets generated by MapReduce jobs or stored in S3 or HDFS volumes.

According to the blog post describing Terrapin, Pinterest ran into serious performance bottlenecks when writing hundreds of gigabytes of data to HBase. Bulk uploading solved the performance problem but caused the inserted data to be scattered across a cluster, which hurt performance all over again.

pinterest terrapin Pinterest

Terrapin ingests data, either already stored or generated by Hadoop jobs, and serves it from HFiles on top of HDFS. Requests for data can be served from multiple clusters at once to lower latency.

Terrapin uses HBase’s existing HFile format for data storage on top of HDFS, but provides its own custom system for figuring out where queried HFile data lives and serving it up from the needed node. 

Data fed into Terrapin can come from a variety of sources — for example, a MapReduce job that writes directly to Terrapin or HDFS/S3/Hive sources. HFiles can also be “live swapped,” with a newer data set replacing an older one on the fly.

Much of the Hadoop infrastructure in use leverages MapReduce, but interest and pressure are mounting to make Spark the new centerpiece for Hadoop. The current integration between Terrapin and Spark consists largely of having Spark write HFiles, but Terrapin’s storage format system is extensible. In theory, any number of Hadoop data storage formats that can be queried via keys (Parquet, for instance) could get their own connectors in time.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author