Contributor

ETL is dead

opinion
Oct 13, 20175 mins

ETL is hard. And it's limiting. Unfortunately, there wasn't a better way because of the constraints of data technology. Until now

5 end of life grave
Credit: Thinkstock

Extract, transform, and load. It doesn’t sound too complicated. But, as anyone who’s managed a data pipeline will tell you, the simple name hides a ton of complexity.

And while none of the steps are easy, the part that gives data engineers nightmares is the transform. Taking raw data, cleaning it, filtering it, reshaping it, summarizing it, and rolling it up so that it’s ready for analysis. That’s where most of your time and energy goes, and it’s where there’s the most room for mistakes.

If ETL is so hard, why do we do it this way?

The answer, in short, is because there was no other option. Data warehouses couldn’t handle the raw data as it was extracted from source systems, in all its complexity and size. So the transform step was necessary before you could load and eventually query data. The cost, however, was steep.

Rather than maintaining raw data that could be transformed into any possible end product, the transform shaped your data into an intermediate form that was less flexible. You lost some of the data’s resolution, imposed the current version of your business’ metrics on the data, and threw out useless data.

And if any of that changed—if you needed hourly data when previously you’d only processed daily data, if your metric definitions changed, or some of that “useless” data turned out to not be so useless after all—then you’d have to fix your transformation logic, reprocess your data, and reload it.

The fix might take days or weeks

It wasn’t a great system, but it’s what we had.

So as technologies change and prior constraints fall away, it’s worth asking what we would do in an ideal world—one where data warehouses were infinitely fast and could handle data of any shape or size. In that world, there’d be no reason to transform data before loading it. You’d extract it and load it in its rawest form.

You’d still want to transform the data, because querying low-quality, dirty data isn’t likely to yield much business value. But your infinitely fast data warehouse could handle that transformation right at query time. The transformation and query would all be a single step. Think of it as just-in-time transformation. Or ELT.

The advantage of this imaginary system is clear: You wouldn’t have to decide ahead of time which data to discard or which version of your metric definitions to use. You’d always use the freshest version of your transformation logic, giving you total flexibility and agility.

So, is that the world we live in? And if so, should we switch to ELT?

Not quite. Data warehouses have indeed gotten several orders of magnitude faster and cheaper. Transformations that used to take hours and cost thousands of dollars now take seconds and cost pennies. But they can still get bogged down with misshapen data or huge processes.

So there’s still some transformation that’s best accomplished outside the warehouse. Removing irrelevant or dirty data, and doing heavyweight reshaping, is still often a preloading process. But this initial transform is a much smaller step and thus much less likely to need updating down the road.

Basically, it’s gone from a big, all-encompassing ‘T’ to a much smaller ‘t’

Once the initial transform is done, it’d be nice to move the rest of the transform to query time. But especially with larger data volumes, the data warehouses still aren’t quite fast enough to make that workable. (Plus, you still need a good way to manage the business logic and impose it as people query.)

So instead of moving all of that transformation to query time, more and more companies are doing most of it in the data warehouse—but they’re doing it immediately after loading. This gives them lots more agility than in the old system, but maintains tolerable performance. For now, at least, this is where the biggest “T” is happening.

The lightest-weight transformations—the ones the warehouses can do very quickly—are happening right at query time. This represents another small “t,” but it has a very different focus than the preloading “t.” That’s because these lightweight transformations often involve prototypes of new metrics and more ad hoc exploration, so the total flexibility that query-time transformation provides is ideal.

In short, we’re seeing a huge shift that takes advantage of new technologies to make analytics more flexible, more responsive, and more performant. As a result, employees are making better decisions using data that was previously slow, inaccessible, or worst of all, wrong. And the companies that embrace this shift are outpacing rivals stuck in the old way of doing things.

ETL? ETL is dead. But long live … um … EtLTt?

Daniel Mintz is the chief data evangelist at Looker Data Sciences. Previously, he was head of data and analytics at fast-growing media startup Upworthy, and before that he was director of analytics at MoveOn.org. Throughout his career, he has focused on how people interact with data in their everyday lives and how they can use it to get better at what they do.

Daniel has appeared in numerous media venues, including The New York Times, The Washington Post, NPR's All Things Considered and MSNBC's The Last Word. He has served on panels at the eMetrics Summit, the Big Data Innovation Summit, Netroots Nation, Rootscamp and Yale Law School, and he frequently leads statistics and analytics training programs.

Daniel holds a bachelor's degree in music and political science from Rutgers University and a master's degree in multimedia engineering from UC Santa Barbara's Media Arts and Technology program. An avid photographer and bread baker, he lives in Brooklyn with his wife, his daughter and his giant white dog.

The opinions expressed in this blog are those of Daniel Mintz and do not necessarily represent those of IDG Communications Inc. or its parent, subsidiary or affiliated companies.

More from this author