Talk to most any data engineer who uses Apache Kafka, and they’ll have a lot to say. They’ll be able to list everything about the platform that frustrates them, but conclude with the fact that they love it.
What’s going on?
Apache Kafka is a cornerstone of modern data infrastructure. In many ways, it shines, but it comes with complications that impact some users more than others. Over the past decade or so, we’ve come to see Kafka as the only option of its kind, and we accept the idea that it will inevitably challenge us.
But as the data landscape rapidly broadens, we shouldn’t continue to take this for granted. Viable, user-friendly alternatives are becoming available. In the interest of saving time and effort, it’s worthwhile to re-evaluate whether a legacy system like Kafka is the best choice for your organization.
What is Apache Kafka and what does it do?
Apache Kafka is an industry-leading open-source event streaming platform. Started by LinkedIn employees in 2010, it was later donated to the Apache foundation. Kafka is praised for being an extremely powerful platform that provides the data infrastructure foundation for thousands of companies, including big names like Netflix, Airbnb, and Twitter.
Kafka is used as an event bus. Applications produce events — or messages — into Kafka, which records them into an ordered message history. Other applications consume those messages in order and are notified in real time as new messages are produced. In contrast with a traditional database, which is well-suited for queries and updates against a current state, Kafka excels for applications that must quickly act on the various changes that lead to a current state.
Kafka is run on a cluster of fault-tolerant servers. Client systems called producers publish, or write, events to Kafka servers. Another group of clients called consumers subscribes to, or reads, those events. Messages are contained in topics, which can be further partitioned.
Kafka provides an alternative to older messaging queues — such as RabbitMQ — in some significant ways, mainly:
- It can be easily scaled horizontally, by adding more nodes to the cluster and partitions to individual topics
- It can persist messages for a configurable period rather than deleting them as soon as they reach the consumer
What is Kafka good for?
As we can see, Kafka was the first event-streaming platform of its kind, and now has over a decade of maturity and a huge user base. As the need for well-managed, low-latency data streams becomes more and more obvious, even the most traditional companies are taking note — and often turning to Kafka.
For huge enterprises that build big, highly customized data pipelines, like Netflix, Kafka can provide a backbone. Many other open-source projects are built on top of Kafka. If you have a team of engineers to dedicate to Kafka, or a lot of time on your hands, you can get great results in a wide variety of use cases.
In short: Kafka is fast, efficient, customizable, powerful, and, when managed well, very reliable.
What are the issues with Kafka?
As a business leader for a small to mid-sized organization, you might look at the breakthrough Kafka made in 2010 and the subsequent successes of the Netflix-es of the world and get excited. Could Kafka — open-source and well-maintained as it is — be the answer to siloed data and latency issues for all?
Kafka is inherently hard to implement and maintain, to the point where many organizations fail to meet their goals. As the backbone of your data infrastructure, Kafka is a large, complex system on its own. On top of that, it brings in additional complexities when integrating with client systems.
These characteristic challenges create a high barrier to entry and ongoing maintenance headaches. Let’s get into more detail.
Operating a Kafka deployment is a big and complex job. By nature, your Kafka deployment is pretty much guaranteed to be a large-scale project. Imagine operating an equally large-scale MySQL database that is used by multiple critical applications. You’d almost certainly need to hire a database administrator (or a whole team of them) to manage it. Kafka is no different. It’s a big, complex system that tends to be shared among multiple client applications. Of course it’s not easy to operate!
Kafka administrators must answer hard design questions from the get-go. This includes defining how messages are stored in partitioned topics, retention, and team or application quotas. We won’t get into detail here, but you can think of this task as designing a database schema, but with the added dimension of time, which multiplies the complexity. You need to consider what each message represents, how to ensure it will be consumed in the proper order, where and how to enact stateful transformations, and much more — all with extreme precision.
Kafka is a low-level tool. Like MySQL or any other database, Kafka is an open-ended tool that doesn’t provide one single, easy path to success. By itself, it does nothing to solve actual business problems.
In order to get any value out of Kafka, you need client applications that read and write data. Kafka itself doesn’t care how that’s done. It instead prioritizes flexibility to handle lots of different use cases. This is part of what makes Kafka so powerful, but in order to do anything useful with a Kafka cluster, your client applications need to bring their own opinions about how to use those flexible APIs.
You’ve probably heard the words “streaming is hard” before, and this is exactly why.
Kafka may be open-sourced, but its inherent difficulty has lead to a proliferation of companies that exist solely to manage it — including Confluent, which was founded by the engineering team that created Kafka. It’s simply not designed to be dealt with by anything less than a team of specialized engineers.
Aside from being logistically difficult, Kafka has some other limitations that make it less than ideal for certain use-cases.
- On-the-fly transformation is challenging. Kafka provides a consumer framework that can be used to power streaming applications within the runtime, but the nuts and bolts of it are up to you. KsqlDB or Confluent can help, but it’s not intuitive to do on your own.
- There’s a limitation to historical data. Although one of Kafka’s advantages is the option to persist events, scaling can become challenging and is constrained by available disk space, servers, and ongoing data migrations. Kafka was initially designed for on-premise storage, so deploying it on the cloud (where storage is almost a non-issue) involves additional legwork and ongoing compromise. Should a company use slower but more scalable network-attached persistent disks or faster but ephemeral SSDs which may require more data migrations?
- Kafka is not designed for batch data. Real-time data powers lots of newer workflows, but the batch paradigm is far from obsolete. Kafka doesn’t handle batch without a bit of hacking, and the user must be especially wary of not overloading their brokers with large-scale batch reads — the same brokers responsible for recording their new real-time data
Evaluating Kafka with fresh eyes
Imagine, after a lengthy process and overcoming numerous frustrating roadblocks, you’ve successfully deployed Kafka as the central element of your organization’s data pipeline. You’ve also figured out who’s going to manage different aspects of the system, and how to manage the people that make up that new team
You feel quite accomplished — and rightfully so. You did something challenging and built a valuable system. By now, you and Kafka have been through a lot together, and you can reliably identify and placate most of its quirks.
But there are a couple of assumptions underpinning this scenario:
- That Kafka is the only solution to your data infrastructure problem
- That the value of your system is worth the large (time and effort) cost
We’ve accepted Kafka’s status quo because for years it provided the only solution of its kind. If getting a fast, robust, scalable streaming pipeline required some headaches, so be it. That was simply the price we had to pay.
And in a way, it made sense. We’re all conditioned to assume that something that provides a high degree of value must also come with a high cost. For Kafka deployment, that cost takes the form of time and effort — which, for a company, are basically the same thing as money.
For physical goods, manufacturing advancements and increased competition bring down costs over time. Where technology is concerned, the time and effort required for a given solution likewise diminish. When it comes to creating scalable, real-time data pipelines, it may be time to re-evaluate value and cost.
And in scenarios like the one above, perhaps we can redefine our idea of success: not as a victory over a system’s inherent complications, but as the choice of a solution that meets our needs with minimal time and effort.
Rethinking cost and value for data pipelines
Many technology systems, including those used for data infrastructure and analysis, are in a period of growth akin to the industrial revolution.
Our ability to create complex systems has improved dramatically, and our ideas about what data infrastructure can and should be have changed. Startup companies are powering much of this movement, creating new products and services. VC firm Andreeson Horowitz describes these trends in detail in this article.
At this point in time, it makes sense to ask more of your data infrastructure systems and consider new alternatives. For Kafka, these alternatives include Apache Pulsar, Redpanda, StreamSets, and even newer products like Estuary Flow.
You can define what makes a system valuable to your organization and use that to evaluate options. And you can now fairly expect systems to offer scalability and flexibility with less effort and confusion.
You’re free to evaluate if Kafka (or any other big-name platform) is right for you. So ask yourself some questions, such as:
- What matters more to you: maturity and user community, or ease of use and new value adds?
- Is Kafka’s specific architecture well-aligned with your business goals to the point where logistical challenges are worth it?
- Or are you simply looking for a flexible, reliable, real-time data pipeline — not necessarily Kafka?
- If you already have a Kafka deployment, could some of the time and effort your team currently spends managing it be spent on other projects?
At Estuary, we’re proud to be part of the startup movement that’s making big changes to data infrastructure. We believe that real-time data pipelines that are easy to deploy and use for organizations of all sizes can become the norm.
That’s the vision behind our growing product, Flow. You can learn more about it on a high level on its product page, in technical detail in our docs, or take a look at the code.
2 thoughts on “Re-evaluating Kafka: issues and alternatives for real-time”
Pingback: The costs of data integration explained, and how to minimize them
Pingback: How new pipeline tools are changing data engineering in the 2020s - Estuary