Kafka is the most widely used open-source distributed event-streaming platform globally. It is an excellent tool for high-performance streaming analytics, data pipelines, data integration, and mission-critical applications. But despite its benefits, it’s not for everything. Many businesses still find great utility in connecting Kafka to databases like PostgreSQL.
PostgreSQL is the most advanced open-source object-relational database system. For most businesses, PostgreSQL is used as the primary data store for multiple mobile, web, analytics, and geospatial applications.
By using a streaming platform with a PostgreSQL database, you can essentially leverage the power of the two to better access, store, and utilize data for your systems and apps. But with the internet awash with tutorials on the subject, which method do you pick?
In this article, we’ll break down one of the simplest methods to connect Kafka to PostgreSQL without losing the benefits of either platform… and without wasting time.
Introduction To Kafka And PostgreSQL
Apache Kafka is an open-source event broker that can stream a large volume of messages using a distributed architecture. It processes messages in a fault-tolerant manner, organized in partitioned units called topics.
As a real-time streaming platform, Kafka allows engineers to build data pipelines and pretty much any other type of data-streaming application imaginable.
It’s an open-ended framework, making it versatile but also challenging to use in a standalone manner.
Features Of Kafka
Here are some key features of Kafka:
- Guaranteed exactly-once delivery for optimal data accuracy.
- Kafka is scalable with minimal downtime thanks to its distributed architecture made of nodes called brokers.
- A variety of free and paid tools exist to enable real-time data transformation with Kafka, including ksqlDB and KStream.
PostgreSQL is an enterprise-level, open-source relational database management system that is greatly in use due to its superior performance. It uses JSON to query non-relational data and standard SQL to query relational data.
Postgres is highly scalable and performant, while also enabling many advanced data types and optimization processes. Commercial databases like SQL Server and Oracle support just a few of these data types.
Features of PostgreSQL
Some of the key features of PostgreSQL are listed below:
- PostgreSQL has great support for queries.
- The platform supports client-server network architecture.
- It is also fault-tolerant thanks to its write-ahead-logging (WAL) feature.
- Postgres supports geographic information systems (GIS) data and location-based services.
Benefits Of Connecting Kafka To PostgreSQL
As discussed above, Apache Kafka has proven benefits in managing large volumes of real-time data. It is incredibly durable and is also fault-tolerant. But PostgreSQL is an object-relational database, providing familiar data structure and querying capabilities while still being highly scalable.
While Kafka is a powerful tool for handling large volumes of streaming data, it can be challenging to work with due to its complex architecture and the need to write custom code to process data streams. Additionally, the data stored in Kafka topics is often not easily accessible or useful until it is transformed and loaded into another system.
This is where integrating Kafka with a database like PostgreSQL can be valuable. You can use Kafka to ingest and stream data, and then use PostgreSQL as a foundation for building apps and websites that use that data. This allows you to leverage the strengths of both technologies: Kafka for handling high volumes of streaming data and PostgreSQL for storing and querying structured data.
Let’s now take a detailed look at the use cases for connecting Kafka to PostgreSQL.
1. Business Use Cases For This Integration
- Enable real-time alerts for market changes.
- Analyze data as it is generated in real-time.
- Program a real-time machine learning (ML) inference.
- Be faster than competitors when responding to events.
2. Technical Use Cases For This Integration
There are various technical benefits of connecting Kafka to PostgreSQL too. These include:
2.1 Real-Time Views in Postgres
While Postgres supports materialized views, you can only make them continuous with a real-time data integration.
By sourcing data from Kafka, you can set up a continuous materialized view in Postgres that displays an always-accurate summary of important data.
2.2 Real-Time Data In Your Postgres-Driven Apps
The majority of businesses today strive to become event-driven. This integration allows you to easily upgrade Postgres-driven applications to support real-time data — no need to change the database you use.
Tutorial: Connecting Kafka To PostgreSQL
There’s a variety of methods to connect Kafka to PostgreSQL — some easier than others. In this article, we’ll cover one of the simplest methods: using Estuary Flow.
Flow is a DataOps data integration platform from Estuary that offers real-time, scalable, and configurable data integration capabilities. Like Kafka, Flow is an event-based, real-time system. It offers a low-code method to bridge the gap between Kafka and Postgres.
With a variety of open-source connectors, Flow’s centralized platform can create real-time data pipelines that link numerous source and destination data systems. However, Flow includes an event-driven runtime, unlike a standard ETL platform, enabling true, real-time data pipelines.
Estuary provides a built-in Kafka connector to help database users migrate their data from Kafka to PostgreSQL. Here’s how you can get started:
To complete this workflow, you’ll need:
- An Estuary Flow trial or organizational account. Go to the Flow web app to sign up for a trial, or, if you want to start creating production data pipelines, contact Estuary to get your organizational account.
- A Kafka cluster with:
- bootstrap.servers configured so that Flow can connect via the desired host and port
- An authentication mechanism and connection security set up (highly recommended for production environments)
- A PostgreSQL database with:
- Credentials for a user with write access.
- The database host and port on hand.
Capture from Kafka
You’ll first capture data from Kafka. This process will create one or more data collections in Flow, which you’ll then materialize to Postgres.
- Go to the Flow web application at dashboard.estuary.dev and sign in using the credentials provided by your Estuary account manager.
- Click the Captures tab and choose New Capture.
- Choose the Kafka tile.
A form appears with the properties required for the Kafka connector.
- Type a name for your capture.
Click inside the Name field to generate a drop-down menu of available prefixes, and select your prefix (for trial accounts, your prefix will be “trial/”). Append a unique capture name after the / to create the full name, for example, trial/myKafkaCapture.
- Supply Bootstrap Servers, TLS connection settings, and Authentication, if using. Click Next.
Flow uses the provided information to initiate a connection to Kafka. It identifies the topics and in your cluster and maps each to a Flow collection.
The Collection Selector appears, showing this list of possible collections.
- Look over the list in the Collection Selector and remove any collections from Kafka topics you don’t want to capture.
- Scroll down to the Specification Editor.
Here, you can view the generated capture definition and the schema for each collection.
Flow displays these specifications as JSON.
- For each collection, modify its JSON schema in the Specification Editor to accurately describe the shape of your Kafka data. The schema will be essential to how the data is mapped to Postgres tables later. (Because Kafka is an unopinionated message broker, you need to supply the schemas manually — Flow isn’t able to infer schemas automatically as easily as it can with some other systems.)
- Once you’re satisfied with the configuration, click Save and Publish. You’ll see a notification when the capture publishes successfully.
- Click Materialize Collections to continue.
Materialize to PostgreSQL
Now that you’ve captured data from Kafka topics into one or more Flow collections, you can materialize them to Postgres tables.
- Select the PostgreSQL tile.
The page populates with the properties required for Postgres. Details of the collections you just created are already filled out in the Collection Selector.
- Choose a unique name for your materialization like you did when naming your capture; for example, trial/myPGMaterialization.
- Fill out the Address in the format host:port, User, and Password. Click Next.
Flow initiates a connection with the database and maps each Flow collection from your capture to a new table that it will create on your behalf.
- Click Save and Publish. You’ll see a notification when the full Data Flow publishes successfully.
From now on, new data that streams through your Kafka topics will be materialized to the appropriate Postgres table in milliseconds.
Why Use Estuary?
Estuary as a data integration platform has many benefits when connecting Kafka to PostgreSQL. For one, the platform offers a short process to set up real-time pipelines in its GUI.
In the tutorial, we used Flow’s schema validation capabilities to describe the shape of the data. Flow also allows you to re-shape data with native real-time transformations.
And Estuary Flow isn’t just limited to Kafka and PostgreSQL either. With a huge library of connectors to choose from, including databases and SaaS apps, you can create a wide variety of real-time pipelines.
Connecting Kafka to PostgreSQL has obvious benefits for businesses and users alike. But doing so can be a difficult and time-consuming task even for the most seasoned professionals.
While there are numerous ways to build this connection, today’s tutorial walked you through the easiest one and can be set up in a minimal period of time.