The diverse world of data management and engineering ultimately serves one simple goal: turning raw data into actionable information. This requires:
- Moving data from one place to another: for example, from an operational system to a data warehouse.
- Transforming that data to answer questions: for example, to find the demographics of your customer base or measure your business’s KPIs.
If you’ve set out to design such as system, you’ve likely heard of two paradigms called ETL and ELT. It’s easy to get the impression that they’re rigidly defined and mutually exclusive.
Let’s look closely at what each term means, the underlying reality, and how we can use this to understand data infrastructure better.
ELT vs ETL: the simplified view
ETL and ELT comprise the same three steps of data integration. They are:
- Extract. Data is extracted, usually from multiple sources — APIs, applications, webhooks, sensors, etc.
- Transform: Data is formatted, structured, enriched, or otherwise modified for reporting and answering analytical questions.
- Load: Data is loaded into a final storage platform, such as a physical database or cloud storage.
All three steps are required, but the way in which we perform them can vary. The traditional definitions of ETL and ELT — the definitions we most commonly encounter — offer two basic approaches. These are generalizations, but they say a lot about the types of data workflows we expect to see:
- What is ETL? Extract transform load (ETL) integrations are typically built with specific analyses in mind. Since transformations are applied pre-storage, engineers need to know in advance what data they’ll need and what characteristics it must have. This now-orderly data is stored in a relational database either on-premise or in the cloud. When this is done on-prem, organizations can maximize data security.
- What is ELT? In extract load transform (ELT) integrations, data is stored immediately and transformed later. Cloud data warehouses or data lakes are usually the storage solution of choice. This allows huge volumes of structured, semi-structured, and unstructured data to be stored. Analysts can define their questions and requirements later, and apply transformations at that point.
This division implies that organizations must choose between stability and flexibility in their data architecture, which is hard because most organizations value both very highly.
So, how did we get here?
The history of the split paradigm
ETL arose alongside the relational database in the 1970s. Computing and storage capabilities were limited at that time, so ETL integrations were built around predictable, straightforward analyses. Tracking KPIs for a popular product is a common example. These early transforms were usually built in-house and ran on the organization’s servers during overnight downtime.
With the rise of distributed processing in the early 2000s, more powerful transformations became possible at a larger scale. Organizations began to adopt processing engines as they came onto the scene: first Hadoop, then Spark, Flink, and others. Such systems are common ETL components to this day, but they’re expensive to operate — not only the systems themselves, but also the specialized engineering teams needed to maintain them.
Advances in cloud storage and streaming made ELT possible. It offered speed and flexibility, but perhaps more importantly, it was cheaper.
Instead of specialized engineers working with a complex ETL tool, any analyst could query the data warehouse on their own time using SQL. Plus, cloud storage provided virtually unlimited scaling at a relatively low cost compared to traditional data warehouses. Between moving to the cloud, cutting out the external processing system, and reducing the number of engineering hires, data infrastructure suddenly became affordable.
The truth about ELT
ELT was a new paradigm for two major reasons:
- Location of major transformations: directly within the data warehouse instead of an external analytic system.
- Central priority: getting data into cloud storage as fast as possible rather than transforming it immediately.
This is what allows ELT systems to be cost-effective and streamlined. But it’s arguably not enough to set them apart from ETL systems. Here’s why:
Any high-quality ELT setup does have a transformation prior to loading.
If you purchase an ELT solution, it’ll come with a cloud data warehouse. While these are more flexible than relational databases, the loaded data must still have a schema. It doesn’t have to be complex, but it must be documented and queryable. Most likely, you’ll need to modify at least some incoming data to fit the schema.
This modification is a transformation: a small transformation, but one that fits the definition nonetheless.
These minor transformations can be done on the fly and don’t require a large processing engine or introduce latency, like most traditional ETL transformations. Thus we tend to gloss over them when we talk about ELT. Either way, when you look closely, there aren’t many good examples of true ELT — at least if you want your stored data to be usable.
Can I use a data lake for my ELT pipeline?
At first glance, part of the allure of ELT is the implication that you can pipe all your raw data indiscriminately into storage and extract what you need for analysis at any time. It can be tempting to build a true ELT pipeline using a data lake as your storage endpoint, which will accept any structured, semi-structured, and unstructured data.
This rarely works, and it can place an undue burden on data analysts.
Once you load your data into a data lake, locating and retrieving it becomes nearly impossible. It’s akin to an entire organization throwing its paper files in a giant pile in the basement. This is where the criticism of the data lake as a data swamp comes from. Data lakes may have their place, but they’re not a good central repository for your organization’s entire data collection.
Perhaps a better way to think about ELT is as a modernization of ETL. It reduces some of the pain points — latency and cost — but doesn’t completely reinvent the wheel.
A new paradigm: modernizing ETL
Once we start looking at ELT not as a totally separate method but as a subset of ETL, we can begin to view the landscape as one that is evolving rather than dividing. Since the rise of ELT, we’ve seen many other new solutions go further to solve common pain points.
See if you recognize any of these terms:
- ETLT. We’ve already established that properly managed ELT pipelines include light transforms before loading. If you beef these up a bit, you might call the result ETLT. Transforming data before and after storage can reconcile previously divided workflows and help strike a balance between data management and flexibility. For example, you can enrich data on the fly and perform complex analysis in the cloud.
- Modern ETL. Traditional ETL was characterized by compute-intensive, high-latency transformations — that’s part of what prompted the rise of ELT. But ETL advances have been far from stagnant. Today, ETL solutions can use streaming data and perform real-time analysis, while still supporting batch workflows. More source and destination systems are supported.
- Reverse ETL. In reverse ETL, like ELT, data is loaded quickly into the data store. Unlike ELT, the data is then pushed back out of the data store into a variety of third-party systems. This allows operationalized data usage while retaining a single source of truth.
These are just three examples with recognizable names — the actual landscape is a lot more complex. The most important takeaway is that you can leverage the existing components of your data stack and create a system that meets your data goals.
The best modern data integrations are hard to label, and definitely don’t fall under the definitions of ETL or ELT. At this point, these acronyms are more useful as a thought exercise than they are for describing systems that actually exist.
ETL, ELT, and Estuary Flow
There’s a growing number of products, services, and open-source solutions available to modernize data infrastructure.
This is a good sign: it represents the growing variety of options at your disposal. But as the old ETL paradigm continues to subdivide, it’s important to make sure you don’t end up siloing yourself into a narrow system.
Our platform, Flow, is neither ETL nor ELT. It allows you to build real-time data pipelines around your existing systems. You can build transformations directly into the pipeline, or use an external analytics engine or SaaS product. The result? A dependable, extensible system that’s as unique as your use case.
Want to talk about what this would look like? Get in touch. Or sign up for a free trial of the new Flow web application.
Wondering how it works? We develop in the open.