Blog Post

ETL v ELT

,

This is part of a series on my preparation for the DP-900 exam. This is the Microsoft Azure Data Fundamentals, part of a number of certification paths. You can read various posts I’ve created as part of this learning experience.

I don’t have an ELT tag, and I’m not likely to make one. I tend to think of ETL as loading data somewhere, even though I know it means more.

The important concepts for DP-900 here are that ELT is becoming more important and you need to understand what this means. I’ll cover these concepts, but also give a little overlap with where the different Azure services fit in with this.

ETL

For most of my career, the pattern for loading data was Extract-Transform-Load. In this pattern we:

  • grab data from a source
  • make changes to clean/change/etc.
  • write to a target (or sink)

It’s how tools like SSIS work. They connect a source to a target and have a bunch of tasks or transforms in the middle that change the data in some way.

This is a good pattern for getting the work done when the target system is just built for querying data, such as a data warehouse. It is also good when you need to scrub some data, perhaps for privacy reasons.

This isn’t a good pattern when you are trying to load data quickly as the transform process takes time.

ELT

This is the new way of doing things. I this patter we Extract-Load-Transform, though really, it’s not a pattern that quite makes sense in that the process of moving the data just moves it.

Here we:

  • grab data from a source
  • write it to a target

Where’s the transform? Well, that happens on the target, often when someone queries the data. Modern analytic systems, like Snowflake and Synapse, can work with vast quantities of data, often stored in a data lake or blob system, and consume that with powerful computational capabilities. There could be some minor re-shaping of the data on write, but that’s not the idea.

This is good when you might not read all the data. Why process (transform) what isn’t being read. Before you complain that you should know what is used, none of us know if all our data is being used. Unless we write crappy SELECT * code with no WHERE clauses.

This is also good when we need to work at speed and privacy isn’t a concern. It’s great for the known formats of files sent to us, as the target system can project a table on top of a ser of files.

ELT seems to be the current future direction of many analytical and warehouse systems.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating