Who is this guide for?

Welcome to Estuary’s guide to change data capture in MySQL!

This resource is designed to support a wide range of data-focused and data-adjacent professionals who meet two simple requirements:

  • You’re part of an organization that uses the relational database MySQL, or intends to migrate to it.
  • You’re aware of the industry buzz around change data capture (CDC) and want to learn more.

Business leaders: you need a no-nonsense foundation to make decisions. We cover the essentials of MySQL and CDC, plus the pros and cons of different approaches.

Engineers and database managers: you intimately know the systems you work with, but figuring out the best changes to make can be time-consuming. Here you’ll find a succinct intro to a range of MySQL CDC methodologies.

Implementing a new data architecture or strategy requires understanding and buy-in from a variety of teams that may not work together day-to-day. Having a common starting point can make the collaborative process easier. That’s what this guide is for.

Who is Estuary?

Estuary is the company behind Flow, a next-generation DataOps platform that offers real-time, scalable, flexible data integration capabilities. That includes real-time CDC for MySQL, plus much more. Our team has decades of combined experience in data engineering. And we understand the roadblocks data-driven companies face because we’ve lived them.

While we’re excited to discuss Flow’s CDC capabilities, that’s not the point of this guide. Instead, we aim to provide a holistic overview of MySQL CDC and give you the tools to make the right choice for your unique use case.

(PS – we have a similar guide to PostgreSQL you can find here.)

Let’s get started.

Contents

  1. MySQL: an Introduction
  2. Why Change Data Capture for MySQL?
  3. MySQL Change Data Capture Methods
  4. Estuary Flow for Real-Time MySQL CDC
  5. Additional Resources

MySQL: an Introduction

MySQL is a popular open-source relational database. Though it’s free to use, MySQL has been owned and sponsored by a corporation since its inception: first by the Swedish MySQL AB, and later by Sun Microsystems, which is now the Oracle Corporation.

There are a handful of other popular relational databases, including PostgreSQL, Oracle SQLServer, and DB2. Here are a few reasons you might choose MySQL specifically.

  • MySQL was developed for fast performance out of the box, but offers customization for advanced users with specialized use-cases, including 16 possible storage engines.
  • It’s known as an extremely secure database, featuring a unique security layer and encryption capabilities. It’s fault-tolerant and ACID compliant, making it an all-around reliable option.
  • It’s free and open-source, providing flexibility, but official support channels are available.
  • Compared to other RBDMS, it’s simpler to set up and start working with.
  • MySQL is well-suited for CDC workflows.

MySQL is commonly used as the primary transactional database for a data stack. That is, it acts as the quick, robust system of record that powers applications and keeps track of rapidly changing information.

Consider one very popular application for MySQL: e-commerce. An online store needs a transactional database to track data about customers and inventory, and to process sales. MySQL is an ideal choice because it’s a fast and secure row-based relational database.

Just as an online store is unlikely to be the only application a retailer builds, MySQL is unlikely to stand alone as a data system. Relational databases simply aren’t ideal for all workflows — most notably, data analytics.

While MySQL is great at managing the transactions that make sales and record-keeping possible, it’d be a poor fit for analysis. And for our hypothetical retailer to succeed, they need to analyze their store’s performance, marketing, inventory, and more. Data warehouses are designed to power analysis like this. However, they are ill-suited for transactional workflows, making them complementary to systems like MySQL.

To learn more about different types of data storage, see our blog post.

Why Change Data Capture for MySQL?

In common use patterns like the one described above, MySQL will be the most up-to-date data storage system in the stack. But your data warehouses and other storage systems, which in turn power analytical applications, also need to be accurate.

This is a classic data integration challenge. You need to get the latest data from MySQL to all your other systems in a way that’s:

  • Timely, in a world that expects fresh data within minutes or seconds
  • Reliable in the face of inevitable errors and downtime
  • Minimally harmful to performance, even as your data grows to terabyte scale

When implemented correctly, change data capture can meet all these requirements.

Change data capture, or CDC, is the process of recognizing a change in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information.

MySQL is one of the most commonly requested sources for CDC. Fortunately, it’s also well suited to it.

  • For database administrators, simpler CDC methods are easy to set up using SQL alone.
  • Advanced, real-time CDC methods are also attainable using MySQL’s binary log, or binlog.

In the next section, we’ll cover these methods in more depth.

MySQL Change Data Capture Methods

When we talk about change data capture these days, we’re generally referring to real-time CDC. However, CDC has no temporal requirement by definition. Various methods exist, and they can be batch or real-time. We’ll cover both here.

In general, batch methods tend to be easier to implement, but introduce latency and can have a significant performance impact. Real-time CDC is event-driven, meaning changes from the source, like MySQL, are reflected in the targets almost immediately. Event-driven CDC scales better, but has historically been more challenging.

Queries for MySQL CDC

SQL queries are a simple method for batch CDC in MySQL. They can be set up on the command line or in a client application like MySQL Workbench.

For this method to work, the source table’s schema needs to have a column with a timestamp or version number indicating when it was last updated. If it doesn’t, you’re better off using the trigger method, which we’ll cover below.

Your SQL query will look something like:

SELECT * FROM my_table WHERE time_updated > time_last_query’;

You can run the query at a recurring interval using MySQL’s built-in Event Scheduler.

Now, the selected records still need to make it to the target system, which is where the real complexity of this method lies. Most solutions involve either a bespoke ETL pipeline or a use a challenging event-driven framework like Kafka Connect.

Triggers for MySQL CDC

MySQL allows you to create triggers, functions that listen for insert, update, and delete events on a given table.

You can write these events to a second table, or use them to update timestamps, as mentioned in the section on queries above.

From there, you’ll face the same challenge as when using queries: either building or purchasing a pipeline solution to write the changes to the target system.

MySQL Binary Log for CDC

Both queries and triggers are methods of tracking change events in MySQL, but there’s another built-in MySQL functionality that you can take advantage of. When enabled, the binary log (binlog) tracks change events in the database and stores them as a binary file.

The binlog was designed as a source for database replication, as well as some data recovery implementations. This make it ideal for driving event-based CDC — what many have come to think of as true CDC.

The binlog is enabled by default, but can be turned off. To begin, you’ll need to make sure that yours is enabled and configured to store an appropriate amount of metadata. From there, your CDC platform will parse the binlog to isolate change events, and pass them to the target via a message queue or event bus.

The technical details of this sort of implementation can be complicated, and fall outside the purview of standard database management.

Fortunately, platforms and services are becoming available to make this method more accessible.

The open-source platform Debezium paved the way to making event-based CDC approachable. Debezium uses Apache Kafka as an event bus, and offers out-of-the-box connectors to popular sources, including MySQL. Many other MySQL CDC services are becoming available, and leverage Debezium as their open-source foundation.

Estuary Flow for Real-Time MySQL CDC

Estuary’s DataOps platform, Flow, includes a connector for MySQL CDC.

So, what exactly does that mean?

Flow is a centralized platform for building real-time data pipelines that connect a wide variety of source and destination data systems. It’s meant to be extremely flexible, adapting to your workflows so that all of your data stays in sync. To enable this, the central, event-driven runtime plugs into our many open-source connectors, which integrate with the source or destination system.

When you use our MySQL source connector, Flow acts as a log-driven, real-time CDC pipeline.

This connector leverages the binlog, as discussed above, to capture change events to your Flow data collection. Cloud-storage-backed collection data is streamed through the Flow runtime, which is built on the Gazette, an open-source streaming broker that was also built by Estuary’s engineering team.

Your data collections captured from MySQL are automatically backed up to a data lake in cloud storage. From there, they can be streamed to one or more target systems using any available materialization — or destination — connector. This happens with millisecond latency.

Features of Flow as a MySQL CDC pipeline:

  • High degree of customization using the intuitive GitOps workflow or UI: Add additional data sources or multiple destinations to your CDC pipeline, and combine or separate streams of data without direct engineering requirements on the streaming broker.
  • Ability to add transformations without an additional tool: aggregations, joins, and stateful stream transformations are all possible.
  • Minimize performance impact of historical backfills. Backfilling data from MySQL tables is important for data integrity, even in event-driven CDC, but re-reading large tables can cause performance impacts or even errors like data duplication. Flow avoids having to re-read the whole MySQL table by writing a checkpoint of where it left off in that source table.

Additional Resources

Estuary resources:

Other resources:

 

Leave a Comment

Your email address will not be published. Required fields are marked *