Microsoft SQL Server is a massively popular relational database. It often acts as the main transactional database in your stack and is probably your most accurate, up-to-date source of truth.
But it’s critical that the current state of your data is reflected across your other data tools — and your entire organization.
This is why SQL Server change data capture (CDC) is critical for any business that:
- Runs important applications on SQL Server.
- Cares about accurate, current data for analytics and BI.
In this guide, we’ll walk through the basics of CDC, and how implementing it on SQL Server is a bit different from some other databases. We’ll go over CDC’s advantages, performance impact, and some alternatives.
We’ll also walk step by step through how to enable SQL Server CDC, and how to turn this basic capability into a full-fledged data pipeline.
Interested in CDC in another RDBMS? Check out the guides for Postgres and MySQL.
What is Microsoft SQL Server?
If you’re reading this guide, you probably know a fair bit about SQL Server, so we’ll keep this part brief. But it’s important to discuss SQL Server’s unique characteristics — they’re the reason it’s a candidate for CDC, and they determine which CDC methods are possible.
SQL Server is Microsoft’s flagship relational database (RBDMS). For the uninitiated, that means it stores data in a systematic row-and-column format that’s easy to query quickly.
You’d typically use SQL Server as the main transactional database for a product or application. Its job is to keep track of the current state of affairs, from user interaction to finances to inventory.
The fact that SQL Server is a commercial solution immediately sets it apart from other popular RBDMS, like MySQL and PostgreSQL, which are open-source.
Despite coming with a price tag, SQL Server remains one of the most popular RBDMS available today. It has features and advantages that larger companies with a budget are happy to pay for.
The advantages of SQL Server include:
- Policy-based user management, making security compliance easy.
- Excellent performance under constant high-traffic workloads.
- Some of the most advanced encryption algorithms available.
- As a commercial solution, SQL Server is regularly updated and comes with user support.
- It’s part of a large ecosystem of other products for related tasks like reporting and analytics, as well as the Windows OS, which makes collaboration easier for larger teams.
Moving forward, the two things we should keep in mind about SQL Server are:
- SQL Server is used as a transactional database, so getting data out using CDC is important.
- SQL Server is a commercial solution, and Microsoft provides more built-in functionality for things you might otherwise build on your own.
Why Change Data Capture for SQL Server?
SQL Server is designed to handle transactional workloads. These aren’t the same as financial transactions, but they follow that general pattern: changes to a system influence its overall state.
Database transactions can represent all sorts of real-world actions. Here are a few simple examples:
- Item added to a shopping cart or purchased → Inventory changed.
- Payments → Change to total funds saved, earned, or spent in a category.
- New user enters or user makes an action → Profile of the user is added or has changed.
This means that SQL Server has the most accurate data on things like inventory, finances, or users. And that data is critical for other aspects of the business.
To visualize, manipulate, analyze, or act holistically on insights from your data, you’ll need to get that data out of SQL Server. While RBDMS are great for transactional workloads, they’re poorly suited for analysis.
Because the data in SQL Server changes constantly, data extraction isn’t a one-off event. It’s a continuous process.
The process you choose to move data out of SQL Server should be:
- Reliable (not vulnerable to outages and downtime)
- Timely (your business expects fresh data in minutes or seconds).
- Efficient (doesn’t break the bank at scale).
That’s where CDC comes in.
Change data capture, or CDC, is the process of recognizing a change in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information.
Transaction Log-Based SQL Server CDC
Technically, there are many ways to perform CDC. But for most of this guide, I’ll be talking about log-based CDC for SQL Server, which the platform provides natively. Log-based CDC is the only ultra-low-latency method that allows us to approach real-time insights.
(Toward the end of the article, we’ll also circle back to other methods that are still technically “CDC,” but don’t offer us that same timeliness.)
Real-time CDC methods for any RDBMS rely on some sort of built-in logging capability. In SQL Server, that’s the transaction log. The transaction log records every transaction in the database as well as its impact on table rows.
The transaction log is designed as a backup in case of database failure. But it also tracks the exact information we need to know for CDC.
In other, open-source RDBMS, we’d connect directly to the log to enable CDC. For example, MySQL CDC connects directly to the MySQL binary log. But this is where SQL Server, a commercial product, is a little different.
SQL Server doesn’t allow us to connect directly to the transaction log. Instead, it provides a pre-built CDC functionality that does so for us.
To use it, you enable CDC on the entire database, as well as individual tables you’re interested in. SQL Server will parse its transaction log and create change tables for each of the tables you’ve specified. From here, an external application can read the change data.
Pros and Cons of Change Data Capture for SQL server
SQL Server’s native CDC functionality can make life easier, but it also has limitations.
Advantages: SQL Server CDC
- User-friendly and easy to set up.
- Change data comes in table form organized by source table; no need to parse logs.
- All change tables are in the same location in the same schema, so you only need to create one path to the CDC data for your outgoing pipeline.
- Data pipeline platforms that support CDC typically are built for this method, so it’s easy to move change data to another system.
Disadvantages: SQL Server CDC
- The extra step (creating change tables) adds small amounts of latency to the process, though this is not noticeable at a large scale.
- Maintaining change tables in the database has a performance impact.
Let’s dig a bit more into the performance impact of CDC in SQL Server.
Every time you enable CDC on a table, you’re creating a new table. By default, change data sticks around in these databases for three days before SQL Server deletes it automatically.
This inevitably has a performance impact, which can vary widely depending on:
- How many change tables there are.
- How many changes are being written to those tables.
If both of these factors are significant, you can see your database performance decrease. Be especially mindful of multiple databases using pooled resources! To amend this, you must provision more compute resources. You can also lower the CDC retention period.
Regardless, CDC will almost always have a smaller performance impact than, say, repeatedly querying the entire source table for changes.
SQL Server CDC vs Change Tracking
SQL Server CDC is not to be confused with change tracking. Change tracking is a related functionality built into SQL server, but has a different purpose.
While change data capture records all changes to a table, change tracking only compares the difference between the current state and a given time in the past. Intermediate changes aren’t retained.
In general, if you want to enable real-time data pipelines out of SQL Server, you should use its CDC capability. If you’re using queries to track changes to the database periodically, use change tracking.
Enable SQL Server Native CDC: Step by Step
To enable CDC for tables in your SQL Server database, you’ll need to enable a database-wide setting as well as a setting on the individual tables.
- In Microsoft SQL Server Management Studio or your client of choice, connect to the database.
- Enable CDC on the database.
Tip: If you’re using a hosted cloud provider’s version of SQL server, this command might look different. For example, in Amazon RDS, use
msdb.dbo.rds_cdc_enable_db, and in Google Cloud SQL, use
- Enable CDC on a table in the database (repeat if there are multiple tables). This assumes the table is in the default schema, dbo.
@capture_instanceis optional, but allows you to name the CDC table, if you’d like.
`EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'my_table', @role_name = 'my_role', @capture_instance = 'dbo_my_table'`;
Setting up a Pipeline for SQL Server CDC
Enabling CDC in SQL Server is only half the battle. You still need to get that data out of SQL Server and to the destination system where you need it; for example, a data warehouse.
This is exactly what a data pipeline does. But not all pipelines are equally suited to this purpose.
CDC is event-based by nature and produces updates continuously. To reap the benefits downstream, you’ll need a data pipeline that also has an event-based, streaming (or real-time) architecture.
A data pipeline that uses a batch workflow — that introduces a delay — will nullify all the benefits of CDC.
You’ll also need a pipeline that integrates specifically with SQL Server for CDC. You can engineer this yourself, or use a managed data pipeline platform.
Debezium: Open-Source CDC Framework
CDC pipelines are hard to build. This isn’t the kind of project you’ll want to start from scratch (unless you have a lot of time on your hands). Fortunately, open-source frameworks exist that make the workload reasonable.
The most popular of these is Debezium. Debezium captures events from databases like SQL Server, translates them to JSON, and uses Apache Kafka as its real-time event bus.
If you’re comfortable working in the Kafka ecosystem, this is an excellent choice for you. If you’re not, landing the real-time data that’s now flying around in Kafka topics in a familiar system is going to present a challenge.
In that case, find a CDC platform that includes easier UX and doesn’t force you to interact with a streaming broker directly.
We designed Estuary Flow with this in mind.
Estuary Flow: Easy UX for CDC Pipelines
Flow is a managed platform for building streaming data pipelines. It includes a CDC connector for SQL Server.
Though Flow is built on a streaming broker, you don’t have to manage finicky streaming infrastructure directly. Instead, you manage your data and data pipelines using either the web UI or CLI, whichever you prefer.
Once you’ve captured your tables from SQL Server using CDC, Flow backs them up as low-latency data lakes in the cloud. You can then send them to destination systems like Snowflake, BigQuery, and many more — in real time.
To set up MySQL CDC in the Estuary Flow web UI:
- Enable CDC on the database as described above.
- Run a few more SQL queries to create a watermarks table and a user role to use with the connector.
- Whitelist Estuary’s IP or set up an SSH server.
- Sign up for the free tier of Flow.
- Create a new Capture, filling in the server address, database name, user, and password.
- Publish the capture.
- To complete the pipeline, push the data to a destination system with a materialization.
You can find full setup instructions in the docs.
SQL Server change data capture alternatives
CDC isn’t the only way to get data out of SQL Server. If you’re not operating at a very large scale or you don’t care about latency, one of these methods could be a better fit.
Queries and SQL Server Change Tracking
You can keep track of changes in SQL Server by querying the database at a set time interval. This is a batch method that introduces latency.
SQL Server offers another native capability called change tracking to make this easy.
When change tracking is enabled, SQL Server records how table rows have changed since a checkpoint: usually, the last time you ran a query. Every time you query the change data, change tracking restarts. Unlike with CDC, intermediate changes aren’t recorded: just the difference between the last checkpoint and the current state.
SQL Server also provides special functions to extract change tracking data. From there, you’ll have to figure out how to get it into a destination system (for example, your data warehouse).
Advantages: SQL Server Change Tracking
- Performance is more efficient than manually querying tables.
- You don’t need a timestamp or version number column in the source table like you would if you queried them directly.
Disadvantages: SQL Server Change Tracking
- Not a real-time solution: the query interval introduces latency.
- It’s up to you to get the data to its final destination.
Triggers for SQL Server CDC
Alternatively, you can use a DML trigger function in SQL Server to listen for INSERT, UPDATE, and DELETE events on a table of interest.
Trigger functions are instantaneous, but can have hefty performance implications.
Advantages: SQL Server Triggers
- Detects database changes in real time.
- Relatively easy to set up.
Disadvantages: SQL Server Triggers
- Trigger fires for any qualifying event, whether the event affects 0 rows or thousands. You must be prepared for this downstream.
- It’s up to you to get the data to its final destination.
- Triggers have a negative performance impact on their tables, and having too many triggers in your database isn’t advised.
Key SQL Server Features and Terms for CDC
Real-time SQL Server CDC involves some of SQL Server’s distinct features and capabilities. Here’s a quick reference.
Change data capture (CDC)
CDC has two meanings here:
- The general process of recognizing a change in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information.
- The SQL Server capability that makes note of changes to specified tables, specifically by noting DML activity on tables of interest and making that data available in relational form.
Change table (capture instance)
When CDC is enabled in SQL Server, each table that’s being tracked has a corresponding change table, also known as a capture instance. Each change event is written as a row.
Change tracking is a built-in SQL Server capability that provides an alternative to CDC. It keeps track of the differences in tables between the current state and a historical checkpoint, and is useful for query-based data pipelines.
The SQL Server transaction log records all transactions in a database and the database changes that occurred as a result. It’s designed as a recovery mechanism for system failures, but also provides the foundation for CDC.
Write-ahead logging (WAL)
The SQL Server transaction log uses a mechanism called write-ahead logging to ensure the durability of all database transactions. This means that database transactions are written to the transaction log before they’re written to disk.
In other databases (namely, Postgres) the log itself is called the “write-ahead log.” In SQL Server, it’s called the “transaction log,” but both serve the same function and provide the same guarantees, and both are used for CDC in their respective systems.
Unlike other RBDMS, Microsoft SQL Server offers a native capability for CDC. But to fully enable SQL Server CDC, you need to build a complete data pipeline that gets change data out of the database and into another system. If you’ve concluded this is the right data extraction method for you, you can choose to build that pipeline based on an open-source framework or use a managed platform to save time.
To try Estuary Flow for SQL Server CDC, use these resources:
- 🆓 Sign up for free
- 👋 The Estuary Slack community
- 🧑💻 Source code: Flow and the SQL Server CDC connector
- 📚 Docs: Quickstart guide and the SQL Server CDC connector