Change data capture has become a familiar phrase in the world of data, but it’s not an easy, predictable, or one-size-fits-all solution. Rather, it’s a broad, rapidly evolving category full of complexity — and lots of potential.
In this article, we’ll cover the essential foundations of change data capture. We’ll break down exactly what data capture is, common implementation patterns, and when you should use it.
What is change data capture?
Change data capture, or CDC, is the process of recognizing a change in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information. It’s a method of data integration: ensuring that data is up-to-date across different components of an organization’s data stack.
Let’s flesh that out with a common scenario. Say you’re a company that manages customer records in a relational database, like PostgreSQL, and powers data analysis from a data warehouse, like Snowflake. When a customer record is updated in Postgres, your CDC mechanism takes note and updates the corresponding record(s) in Snowflake.
That’s a simplified use case, but the pattern is a very common one: migrating data updates from an RDBMS to a data warehouse. (Need a refresher on different types of data storage systems? We’ve got you.)
The basic concept of CDC can be applied to other types of data systems, too, using various methods. Though CDC is often marketed as a real-time data solution, that’s not always the case. Depending on the method you use, changes might be reflected in the target in real time or at a regular interval — that is to say, using batch processing.
Change data capture methods
All change data capture architectures have a source: the data system where the change is first stored. They also have a target that needs to reflect updates, usually a different type of system. Dig in any deeper than that, and you’ll find that setups vary widely.
To simplify matters, we’ll group common change data capture methods into three categories: query-based change data capture, change data capture using triggers, and finally, the most sophisticated method: real-time change data capture with database logs.
1: Query-based change data capture
The first change data capture method involves simple SQL queries on tables in the source database. Since CDC sources tend to be relational databases, administrators will be quite comfortable using SQL to manipulate tables. This makes the query method approachable for most teams.
To set this up, source tables must have an extra column to track when each record was last changed. This is usually a timestamp column, but incremented version numbers can be used, as well, with a bit of extra legwork.
|0001||Joe||Shmoe||123 Main St||2022-02-03 15:32:11|
To find changes to the table, you’d run a query against it to select records where the timestamp value is greater than the time at which you last ran the query. For example, if you last queried the customers table at 7 AM on February 1:
SELECT * FROM customers WHERE time_updated > ‘2022-02-01 07:00:00’;
Typically, you’d configure a recurring query at a standard time interval. The selected records would be used to update the target at that same cadence.
Shortcomings of this approach
- Can’t capture deletes: When you completely remove a record from the table — this is known as a hard delete — the query won’t detect it. You can work around this by using soft deletes. This means you retain all records, but add a boolean `deleted` field. However, this isn’t usually ideal because it can bloat tables with useless records.
- Latency: the query method is a form of batch CDC. There will always be some lag between data updates in the source and target. The length depends on your query frequency, which you might not be able to get as high as you’d like for performance reasons.
- Performance: Every time you run the query, you’re polling the entire table. With a large table, you’ll run into performance issues. This tends to be particularly problematic when your table starts fairly small and grow significantly — a common occurrence. You won’t be able to query it as often as you like, and each time you run the query, your entire system can get bogged down, sometimes for hours.
2: Change data capture with database triggers
Trigger functions allow developers to create procedures that are carried out in response to a certain event within the database table. For CDC specifically, you’d write a trigger function to listen for insert, update, and delete events, and build a changelog.
Triggers can be implemented on the SQL level, and their output is immediate, making them an alluring alternative to querying while still staying in familiar territory for most teams. Triggers setup will be different depending on the DBMS you use.
Shortcomings of this approach
- Performance: Triggers impact the performance of the source database, especially at scale. This is because of the way the procedure acts on the database each time a changes are made. As with queries, this performance issue may be insignificant at first, but can become debilitating as your data grows.
- Getting the data out of the source is complex: Triggers may be a good choice if you simply need a record of changes within the original database: a valid but uncommon pattern. However, if your target is a separate system (which is usually the case) things get more complex. Often, the solution will involve polling the changelog, which introduces latency.
3: Real-time change data capture with database logs
As you can tell, the previous two methods have significant limitations related to scale and latency. This rules them out for most organizations.
The reason for the buzz around change data capture is recent years has more to do with the final category of implementation: real-time change data capture using database logs.
In addition to the source and target, this method requires some sort of intermediate system — a CDC platform which typically uses a message queue to relay changes to the target.
The intermediate system listens to the event log included in recent versions of most DBMS. It goes by different names; for example, the write-ahead log (WAL) in Postgres and the binary log (binlog) in MySQL. Regardless, with some configuration, these logs are a reliable means to track change events.
When the intermediate system detects a change in the source, it sends the event via the message queue to the target, where it is consumed nearly instantaneously.
This CDC method isn’t something that can be set up with SQL alone; it has a much more complex architecture. But it has become attainable thanks to new platforms, most notably Debezium. Debezium is an open-source CDC platform that can be used as-is or modified to create custom implementations. It typically relies on Apache Kafka as a message queue.
Debezium-and-Kafka-based architectures are powerful, but they aren’t the last word in CDC. Like any technology system, they have shortcomings, and you can expect to see even more innovation in this space in upcoming years.
When to use real-time change data capture
Though the concept of CDC has been around for a while, technology has only recently caught up with the idea. Source databases themselves have improved, open-source CDC platforms and managed services are proliferating, and real-time event streaming has become common in the industry.
Taken together, these advances allow the full breadth of CDC implementations described above. The third category — real-time change data capture with database logs — is particularly noteworthy and supports the most cutting-edge implementations.
This is fortunate timing, as more and more organizations are turning to real-time CDC out of necessity. As data stacks and data requirements grow, older data integration methods are proving insufficient. At scale, the performance impacts of SQL-based CDC and older ETL setups are becoming showstoppers, not just inconveniences.
You might benefit from real-time change data capture if:
You need a reliable, timely data pipeline that scales.
For the past decade or so, many organizations’ data stacks have been held together by custom ETL pipelines. These are batch pipelines that introduce data latency and are notoriously time-consuming to build and manage.
Many data engineering teams have been assembled around this status quo, but now, the tide is shifting. As databases scale, each ETL job brings a larger processing burden and introduces even more latency. At the same time, business requirements demand data quicker than ever before. This means you need real-time, highly scalable data integrations — and batch ETL doesn’t fit the bill.
You want to minimize operational burdens on the database.
When your database is the only authoritative data source, everyone needs to use it directly. This means that production workflows get bogged down by other types of work, especially analytics.
With CDC acting as a middleman, you can separate these two important categories of data usage. The data warehouse, a system better suited for analysis, will always contain a current, reliable copy of the data. Now, only production workflows will use the database directly, thus improving performance.
Your organization is shifting toward a distributed data mesh.
Perhaps your organization is shifting from a data monolith to a more distributed model: a data mesh, or something akin to it. The benefits of a data mesh from an organizational standpoint are numerous. However, for it to be successful, the broken down components, or microservices, absolutely require timely data flow. CDC can help.
You have the resources to implement change data capture on your stack.
Like any data infrastructure change, whether or not you implement CDC is a business decision. Different methods require different amounts of engineering work, funds, or both. As you plan, this is a critical balancing act.
The SQL-based change data capture methods discussed above are generally easy to implement. However, given their limitations, they’re not worthwhile for most organizations looking to advance their data capabilities.
But real-time change data capture using logs and a message queue has a higher barrier to entry and comes with its own challenges. It can be difficult to evaluate whether this approach is feasible and worth the initial effort for your team.
In the next post of this series, we’ll explore real-time change data capture in more depth: where it shines, where current implementations fall short, and why you should expect more from the newest technology.
Estuary is building a real-time DataOps platform — for all your CDC needs and much more.
» Or, check out the code on GitHub — we’re open source.
» To work with us before our launch, get in touch here.