by yves_de montcheuil

The Twelve Days of Christmas (a Data Carol)

opinion
Dec 22, 20145 mins

On the First day of Christmas, big data gave to me / One Hadoop ecosystem... but many components! Please sing along.

On the First day of Christmas, big data gave to me One Hadoop ecosystem

Do I really need to explain Hadoop? I didn’t think so. For the ecosystem part, let’s sing on…

***

On the Second day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures

Here we start. Hadoop or the RDBMS? Or is it Hadoop and the RDBMS? The two data storage infrastructures serve distinct purposes. Better to make them work together than to oppose them.

***

On the Third day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks

Hadoop’s first incarnation included MapReduce, which is still prevalent for batch processing but doesn’t address interactive uses. The advent of YARN has enabled competing frameworks such as Tez and Spark to provide the real-time answer.

***

On the Fourth day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes

Data lakes are all the furor nowadays: just pour data in the lake, and fish for insight. But the data lake architecture does not address critical challenges of data governance, and as such cannot be put in everyone’s hands. (Four is only for the riddle here)

***

On the Fifth day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop

Built on top of the interactive/real-time frameworks, SQL-on-Hadoop layers started with Hive and now include Stinger, Impala, Hawq, Drill, just to name the main ones. Today there is no clear winner. For Hadoop vendors, this is one of the differentiators between their stack and a plain-vanilla Hadoop implementation.

***

On the Sixth day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools

SQL-on-Hadoop not only enables any “traditional” BI tools to run reports and queries against Hadoop, but there are also a number of BI technologies specifically targeting Hadoop and running natively inside Hadoop, taking advantage of the processing power and flexibility brought by YARN.

***

On the Seventh day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools Seven data wrangling tools

Data wrangling, or data preparation, is the latest rage in the big data ecosystem. Aimed at providing non-data-scientists with rich navigation and discovery inside their data, no matter how convoluted and unclean it is. Paxata, Trifacta, Springbok, only represent the first wave of tools, which will likely exceed the count of seven very soon.

***

On the Eight day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools Seven data wrangling tools Eight data scientists

You should actually consider yourself lucky if you can hire eight of them, let alone pay their exorbitant salaries. Data scientist has become the most sought-after profile in both born-digital companies and in traditional businesses. It is jokingly said that a data scientist is a business analyst who lives in California, but this is not the full story. The data scientist masters technology, to explore and mine data, and possesses a business acumen that enables him to identify data-driven business opportunities.

***

On the Ninth day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools Seven data wrangling tools Eight data scientists  Nine Hadoop vendors

Are there really nine Hadoop vendors? In their Wave report (The Forrester Wave: big data Hadoop Solutions, Q1 2014) Forrester lists Cloudera, Hortonworks, MapR, Pivotal, IBM, Amazon, Microsoft, Intel, Teradata. And that’s without counting Apache. Since this publication, Intel invested in Cloudera and announced they would retire their own distribution. Microsoft and Teradata resell others’ distributions. Oracle (also a reseller) seems to be missing from the list. A real moving target…

***

On the Tenth day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools Seven data wrangling tools Eight data scientists Nine Hadoop vendors Ten data-driven applications

Ten is for the riddle only. There are as many data-driven applications are there are data projects. Once data science has played its magic, once data algorithms have been operationalized, monetization starts. Value of data is only obtained when it is fed to applications, mobile apps, web sites…

***

On the Eleventh day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools Seven data wrangling tools Eight data scientists  Nine Hadoop vendors Ten data-driven applications Eleven NoSQL engines

Like Hadoop distributions, counting NoSQL engines is difficult. However, Gartner identified eleven commercial vendors in a “Who’s Who in NoSQL DBMSs” report published last year: MongoDB, Cloudant, Couchbase, MarkLogic, Neo, Objectivity, Aerospike, Basho, Oracle, Redis, DataStax.

***

On the Twelfth day of Christmas, big data gave to me One Hadoop ecosystem Two data infrastructures Three processing frameworks Four data lakes Five SQL-on-Hadoop Six BI tools Seven data wrangling tools Eight data scientists  Nine Hadoop vendors Ten data-driven applications Eleven NoSQL engines Twelve data-driven APIs

Feeding the data-driven apps, providing the glue with the back-end systems, RESTful APIs are the unseen interfaces that ensure reliable, secure and agile connectivity to the data. They also open up monetization capabilities, enable open data, and many more uses of big data.

Yves de Montcheuil is a recognized authority on digital business trends and information management. A marketing executive with a track record at several successful IT vendors, he is also a strategic advisor for digital companies and runs the International Commission of Tech In France. Yves is a strong presenter, author, blogger, and social media enthusiast.

You can follow Yves on Twitter: @ydemontcheuil, or contact him via LinkedIn.

The opinions expressed in this blog are those of Yves de Montcheuil and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author