Serdar Yegulalp
Senior Writer

Python and Hadoop project puts data scientists first

news analysis
Jul 20, 20152 mins

Still under wraps, Cloudera's IBIS combines Python and Hadoop with data scientists in mind

Scientists and mathematicians have long loved Python as a vehicle for working with data and automation. Python has not lacked for libraries such as Hadoopy or Pydoop to work with Hadoop, but those libraries are designed more with Hadoop users in mind than data scientists proper.

Cloudera’s new project, Ibis, is an open source (Apache licensed) data analysis framework meant to span the gap. It provides “comprehensive support for the built-in analytic capabilities in Impala for simplified ETL, data wrangling, and data analysis,” as Cloudera puts it.

To that end, IBIS seems as much about providing data-science Pythonistas with an automated avenue into Cloudera’s Impala framework (a SQL-querying system for Hadoop) as it is about working with Hadoop. (Cloudera engineers Wes McKinney and Marcel Korrnacker describe Ibis as “providing a high level Python front-end for Hadoop rather than providing low-level access to a computation model like MapReduce or Spark.”)

That said, it’s not hard to see how using Python to work with Impala would allow for new kinds of data-exploration automation. Cloudera CEO Mike Olson described Impala’s utility as a two-way street: “You can run queries [with Impala] that create results that you then MapReduce. You can use MapReduce to analyze data that you then query [with Impala].”

For now, Ibis is offered only as a preview, and Cloudera has hinted at the project’s eventual evolution. “Upcoming versions,” stated the project’s press release, “will allow users to leverage the full range of Python packages as well as author their own Python functions.”

Hadoop has been a Java-centric enterprise since the beginning, meaning anyone with a Python-centric workflow has been forced to deal with the framework at arm’s length or greater. What’ll be key is whether Ibis can in time provide a general soup-to-nuts framework for using Python with Hadoop — both inside and outside of Impala. 

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author