Contributor

The siren song of Hadoop

opinion
May 23, 20173 mins

Hadoop provides power and versatility for data scientists, at the cost of complexity.

Hadoop elephant code
Credit: Thinkstock

Hadoop seems incredibly well-suited to shouldering machine-learning workloads. With HDFS you can store both structured and unstructured data across a cluster of machines, and SQL-on-Hadoop technologies like Hive make those structured data look like database tables. Execution frameworks like Spark let you distribute compute across the cluster as well. On paper, Hadoop is the perfect environment for running compute-intensive distributed machine learning algorithms across a vast amount of data.

Unfortunately, though, Hadoop seems incredibly well-suited for a lot of other things too. Streaming data? Storm and Flink! Security? Kerberos, Sentry, Ranger, and Knox! Data movement and message queues? Flume, Sqoop, and Kafka! SQL? Hive, Impala and Hawq! The Hadoop ecosystem has become a bag of often overlapping and competing technologies. Cloudera vs. Hortonworks vs. MapR is responsible for some of this, as is the dynamism of the open source community.

As a technology enthusiast, this is actually quite exciting. From an implementation perspective it’s a nightmare.

Innovation or insanity?

I’ve seen the pain play out with several organizations. MapReduce code is obsoleted as the organization moves towards Spark. IT is using Hive to control data access, but you can’t easily run Spark jobs against Hive tables. Kerberos makes everything confusing and difficult. To quote the most popular technical guide to Kerberos: “Just as the infamous Necronomicon is a collection of notes scrawled in blood as a warning to others, this book is: (1) Incomplete. (2) Based on experience and superstition, rather than understanding and insight. (3) Contains information that will drive the reader insane.”

At this point the cloud starts to look pretty good. Why suffer all these infrastructure headaches when Amazon has already figured everything out for you with their Hadoop-as-a-service offering, Elastic Map Reduce? Well, you’re about to get caught between EMR’s cost structure and Hadoop’s history as a platform for aggregating vast amounts of consumer-grade storage and compute hardware. HDFS assumes you have access to lots of cheap but fallible disks and helpfully replicates your data 3x by default. EMR then helpfully charges you for all that storage. The entire Hadoop ecosystem has been architected without storage thrift in mind, yet storage will drive the majority of your bottom line in the cloud.

So as someone who’s trying to get real data science work done with Hadoop, you’re fighting several battles — a cacophony of conflicting technologies, multiple ways to accomplish the same goal, natural disconnects between IT and users, a gnarly cost structure in the cloud, and a constantly shifting technology landscape that obsoletes past work.

The solution is to get at least one layer of abstraction between your data science users and the raw Hadoop layer. With platforms that provide this layer of abstraction data scientists can define the work they’d like to do, for example a null value replacement operation followed by building a logistic regression model, while the platform itself identifies the correct set of Hadoop technologies to accomplish those goals, for example choosing between the MapReduce, Pig, and Spark execution frameworks. If a new execution framework were to enter the Hadoop ecosystem, one need only update the data science platform, not thousands of lines of code.

Despite the chaos, Hadoop has tremendous potential to tackle modern machine learning workflows. Just don’t let it drive you insane in the process.

Josh Lewis is the VP, Product at Alpine Data. Josh has ten years of experience across academia and industry in machine learning, data analysis, cognitive science and user experience. Prior to joining Alpine, Josh lead the frontend engineering team at Ayasdi where he built apps and APIs for the healthcare, pharmaceutical and finance verticals, as well as Ayasdi’s domain-general data analysis and visualization software.

Before joining Ayasdi, Josh was a PhD student and postdoc at the UC San Diego Cognitive Science Department where he investigated the role of human perception and insight in the data analysis process. He also developed novel software for applying unsupervised machine learning algorithms called Divvy, a project that was supported by a multi-year NSF grant.

Josh graduated from Pomona College with majors in Cognitive Science and Philosophy.

The opinions expressed in this blog are those of Josh Lewis and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author