Visual ETL considered harmful (to your skills)

opinion
Aug 3, 20155 mins

Why hasn't the ETL process evolved with the times?

ETL (Extract Transform and Load) is the common name for the processes through which data gets pulled from multiple repositories, combined, filtered and aggregated in various ways, and then “loaded” where it is intended to be used for a specific purpose, like a target data mart. Many ETL tools are predicated on the bedrock belief that visual representations of the transformations are easier to create, manage, and understand than corresponding textual descriptions.

The idea goes that if users can simply drag and drop blocks representing data onto a visual design surface and connect the blocks via arrows representing transformation, then anyone can do it. Such visual tools have existed for more than 20 years (probably longer), yet the ETL process has not gotten radically more efficient or transparent. Why is that?

Much of it comes down to the fact that big diagrams of boxes and arrows are hard to understand, maintain, search, and reason about. Text is generally much better for those things. 

In principle, you would think that having a visual representation of an ETL pipeline would be the easiest form to reason about. For example, you might trace back from a single box to all the other boxes that pour data into said box. In fact, real ETL pipelines fan out very quickly, and you immediately find yourself considering very large numbers of tables and columns in any such walk. This quickly means that you need software to help you make sense of it. I think the right HCI metaphors have not yet been invented to really help with this.

What’s more, commercial incentives are generally to lock the user into a specific UI or visual language. By expressing the underlying computation in a proprietary format that requires continued use of the UI, the vendor guarantees hard migration and continued revenue. Once a computation is embedded in one of the proprietary UIs, you really need a specialist in that UI to go in, understand it, and modify it. Furthermore, it’s very hard to figure out if the modifications are correct by simple inspection. It may not feel that way in the beginning when there are only a handful of boxes and arrows, but in enterprise environments, it usually quickly gets to a point where data flow involves dozes of boxes. 

Are there other options? The traditional approach of writing straight line code in low-level imperative languages like Java or C++ has certainly been tried. Gluing together lots of different utilities with complex scripts using Bash or Python has been tried. In fact, ETL is so necessary and ubiquitous in data processing, that you’re likely to find all of the possible approaches. Involving software development at every step is the slowest and most expensive way to go, generally. Whenever there’s a specify-build-test-try loop involving specialists and scheduling, things bog down. 

If you’ve read my previous blog, you know I’m a big fan of SQL, despite its age and its many flaws. SQL is English-like and most people can read and make sense of a statement like:

SELECT customer, count(items) num_items FROM PurchasesTable WHERE date >= DATE ‘2014­01­01’ GROUP BY customer HAVING num_items > 10

ETL pipelines written in SQL are easier to read than complex diagrams that required software to even render on a screen. Being able to read and interpret is often an important characteristic in order to understand why data is coming out of the pipeline in a certain way. Otherwise, the pipeline is just a black box that can only be interpreted by its owners rather than being open and understandable. SQL is often easier to write than using a custom design surface tool, especially when the concepts get a little bit more complex and interrelated. That’s the concept of literacy I was referring to in my previous post. The most common case in large environments is that one is starting from a library of existing statements that need to be modified, rather than starting from a completely blank sheet of paper. Also, once written, the results from one step are much more easily composed with the next step in a computation. 

Enterprises are under pressure to generate results faster and many more people are working with data directly than ever before. Of course, dedicated UI has helped tremendously in areas like visual analysis of data, for example with Tableau. However, the greedy approach of following what seems easiest often leads to long term maintainability issues, lack of agility and higher total cost of ownership. ETL especially is an area where data quality, data governance and traceability of computation are of paramount importance and being able to read SQL is a kind of literacy in this area. Enterprises that invest in this type of literacy create long-term empowerment in their people and their own ability to understand data in disparate systems. 

I’d love to get your feedback on these thoughts. Please feel free to email me directly or post comments below!

In the 1960s, Edsger Dijkstra wrote the heartfelt “Go To Statement Considered Harmful” that inspired many other “XYZ Considered Harmful.” The title was intended to challenge orthodox views on a topic. This is a lighthearted series of posts in that vein.

Prior to founding Metanautix, Theo spent nearly eight years at Google, most recently as a Principal Engineer and Engineering Director of a 75-engineer team in data warehousing, visualization, and analysis. He led the development of Dremel, a large-scale, interactive ad hoc query engine for big data processing that powers Google’s BigQuery, as well as Tenzing, a SQL implementation on MapReduce.

Theo also worked on developing large-scale machine learning systems for personalized search ads, audience analysis systems for display ads, and early prototypes of page preview and results-as-you-type for search. Before Google, Theo was a software engineer at Microsoft and Microsoft Research, building data cleaning features for SQL Server and speech recognition models for Windows and Office. He holds a PhD from Brown University and a BS from Stanford University, both in Mathematics.

The opinions expressed in this blog are those of Theo Vassilakis and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author