Who is this guide for?
Welcome to Estuary’s guide to change data capture in PostgreSQL!
This resource is designed to support a wide range of data-focused and data-adjacent professionals who meet two simple requirements:
- You’re part of an organization that uses the relational database PostgreSQL, or intends to migrate to it.
- You’re aware of the industry buzz around change data capture (CDC) and want to learn more.
Business leaders: you need a no-nonsense foundation to make decisions. We cover the essentials of Postgres and CDC, plus the pros and cons of different approaches.
Engineers and database managers: you intimately know the systems you work with, but figuring out the best changes to make can be time-consuming. Here you’ll find a succinct intro to a range of Postgres CDC methodologies.
Implementing a new data architecture or strategy requires understanding and buy-in from a variety of teams that may not work together day-to-day. Having a common starting point can make the collaborative process easier. That’s what this guide is for.
Who is Estuary?
Estuary is the company behind Flow, a next-generation DataOps platform that offers real-time, scalable, flexible data integration capabilities. That includes real-time CDC for Postgres, plus much more. Our team has decades of combined experience in data engineering. And we understand the roadblocks data-driven companies face because we’ve lived them.
While we’re excited to discuss Flow’s CDC capabilities, that’s not the point of this guide. Instead, we aim to provide a holistic overview of Postgres CDC and give you the tools to make the right choice for your unique use case.
(PS – we have a similar guide to MySQL you can find here.)
Let’s get started.
- PostgreSQL: an Introduction
- Why Change Data Capture for PostgreSQL?
- PostgreSQL Change Data Capture Methods
- Estuary Flow for Real-Time PostgreSQL CDC
- Additional Resources
PostgreSQL: an Introduction
PostgreSQL, also known as Postgres, is a popular open-source object-relational database. It’s completely free, but it’s also an extremely stable enterprise-class system backed by over 25 years of open-source development.
Other popular relational databases include Oracle, MySQL, SQLServer, and DB2. So why use Postgres specifically? Here are a few reasons.
- As mentioned above, it’s free — no matter what you choose to do with it.
- Postgres is known for being extensible and adaptable. It’s open-source, allowing developers to easily integrate it into new projects. It also uses standard SQL, which means it’s less likely to have conflicts with other systems. Plus, it supports user-defined data types.
- It’s highly fault-tolerant, secure, and ACID compliant, meaning transactions are guaranteed to be valid in the face of errors and downtime.
- It goes beyond standard SQL capabilities. Queries in both SQL and JSON are supported, as well as a wide variety of data types, even within the same table.
- Postgres is well-suited for CDC workflows.
One of Postgres’s most popular implementations is as the primary transactional database for an enterprise. That is, it acts as the quick, robust system of record that powers products and applications, and tracks user information.
For example, consider an online store. Postgres would be used to store customer data, track inventory, and process sales. It’s well-suited to these workflows because it’s a row-based relational database.
Though Postgres excels at certain jobs, it’s highly unlikely for it to stand alone. Relational databases like Postgres are just one component of a modern data stack. They’re simply not optimal for all workflows.
The most obvious example of this is analytics.
Let’s return to our online store example. To succeed in e-commerce, you need data analysis to evaluate performance, market correctly, plan inventory, and more. While Postgres would handle the moment-to-moment operations, it’s poorly suited for these analytical workflows. That’s where a data warehouse comes in. Data warehouses are a complimentary type of data storage optimized for analytics, but poorly suited for transactional workflows.
To learn more about different types of data storage, see this post.
Why Change Data Capture for PostgreSQL?
In most implementations like the one described above, your Postgres database will be your most up-to-date data storage system. But you’ll almost certainly have other storage systems, like data warehouses, which in turn power analytical applications.
This presents a classic data integration problem. You need to get the latest data from Postgres to all your other systems in a way that’s:
- Timely, in a world that increasingly demands up-to-the-minute information
- Reliable and able to withstand errors and downtime
- Minimally harmful to performance, even as your data grows to terabyte scale
When implemented correctly, change data capture can meet all these requirements.
Change data capture, or CDC, is the process of recognizing a change in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information.
Postgres is one of the most commonly requested sources for CDC. Fortunately, it’s also well suited to it.
- For database administrators, simpler CDC methods are easy to set up using SQL alone.
- Advanced, real-time CDC methods are also attainable thanks to Postgres’ write-ahead log (WAL).
In the next section, we’ll cover these methods in more depth.
PostgreSQL Change Data Capture Methods
Most of the current change data capture buzz is around real-time CDC: in which changes from the source, like Postgres, are reflected in the targets within seconds or milliseconds. This is for good reason, as we’ll soon see. However, CDC has no temporal requirement by definition.
This means that CDC methods can be batch or real-time. In general, batch methods tend to be easier to implement, but introduce latency and in many cases have significant performance impact. Real-time CDC is event-driven, scales better, and has been challenging — at least historically.
Queries for Postgres CDC
SQL queries are a simple method for batch CDC in Postgres that can be set up using your familiar Postgres client.
For this to be possible, the source table’s schema needs to have a column with a timestamp or version number indicating when it was last updated. If it doesn’t, you’re better off using the trigger method, below.
Your SQL query will look something like:
SELECT * FROM my_table WHERE time_updated > time_last_query’;
Because Postgres doesn’t have a built-in task schedule, you’ll have to use an external cron tool or other scheduler utility to run the query.
At this point, these changes still need to make it to the target system, which presents another challenge. You’ll need to create a bespoke ETL pipeline or involve a challenging event-based framework like Kafka Connect.
Triggers for Postgres CDC
You can write these events to a second table, or use them as an alternative method to update timestamps, as discussed above.
From there, you’ll face the same challenge as when using queries: either building or purchasing a pipeline solution to write the changes to the target system.
Postgres Write-Ahead Log for CDC
Both queries and triggers are methods of tracking change events in Postgres, but Postgres already has this functionality built in. The write-ahead log (WAL) makes note of all changes that have occurred in the database and acts as the authoritative source of the data state.
The WAL was designed mostly for recovery purposes and as insurance for data integrity. Its characteristics make it ideal for event-based CDC — what many have come to think of as true CDC.
First, the WAL must be logically decoded into a format that other applications can read, like tuples or SQL. Then, a message queue or event bus can relay these decoded event logs to the target.
The details of actually sending those decoded change events via the message are complex, and fall outside the purview of standard database management.
Fortunately, platforms and services are becoming available to make this method more approachable.
The open-source platform Debezium paved the way to making event-based CDC approachable. Debezium uses Apache Kafka as an event bus, and offers out-of-the-box connectors to popular sources, including Postgres. Many other Postgres CDC services are becoming available, and leverage Debezium as their open-source foundation.
Estuary Flow for Real-Time PostgreSQL CDC
Estuary’s DataOps platform, Flow, includes a connector for Postgres CDC.
So, what exactly does that mean?
Flow is a centralized platform for building real-time data pipelines that connect a wide variety of source and destination data systems. It’s meant to be extremely flexible, adapting to your workflows so that all of your data stays in sync. To enable this, the central, event-driven runtime plugs into our many open-source connectors, which integrate with the source or destination system.
When you use our Postgres source connector, Flow acts as a log-driven, real-time CDC pipeline.
This connector leverages the WAL, as discussed above, to capture change events to your Flow data collection. Cloud-storage-backed collection data is streamed through the Flow runtime, which is built on the Gazette, an open-source streaming broker that was also built by Estuary’s engineering team.
Your data collections captured from Postgres are automatically backed by a data lake in cloud storage. From there, they can be streamed to one or more target systems using any available materialization — or destination — connector. This occurs with millisecond latency.
Features of Flow as a Postgres CDC pipeline:
- High degree of customization using the intuitive GitOps workflow or UI: Add additional data sources or multiple destinations to your CDC pipeline, and combine or separate streams of data without direct engineering requirements on the streaming broker.
- Ability to add transformations without an additional tool: aggregations, joins, and stateful stream transformations are all possible.
- Minimize performance impact of historical backfills. Backfilling data from Postgres tables is important for data integrity, even in event-driven CDC, but re-reading large tables can cause performance impacts or even errors like data duplication. Flow avoids having to re-read the whole Postgres table by writing a checkpoint of where it left off in that source table.
- Quick Postgres CDC demo in Flow
- Blog series on CDC
- Flow documentation
- Flow and the Postgres CDC connector code on GitHub
- Get started with Flow today