What is Apache Kafka?
Apache Kafka is a popular open-source event streaming platform that can be easily integrated with Estuary Flow. To understand Kafka (and event-streaming in general), it’s helpful to contrast this method to how traditional databases handle data.
Databases store information in terms of objects and states, or descriptive information about the object. At large scales, this can become cumbersome and slow. Kafka, by contrast, stores data in terms of events. Logs of distributed events are easy to scale, so Kafka can handle trillions of events a day.
Originally designed over ten years ago as a messaging queue, Kafka is now a fully-fledged event streaming solution. It has low latency, which means it can rapidly handle large amounts of data. It’s also fairly easy to deploy. For these reasons, Kafka has become popular with large organizations that need to leverage data in real-time through pipelines.
Still, there are some limitations to be aware of. Kafka excels at real-time processing, but integrating it with other aspects of your organization’s infrastructure can be complicated and clumsy. Configuring Kafka to handle both real-time and batch processes requires a lot of engineering resources. What’s more, the platform is now ten years old. Cloud-scalable solutions have taken off in popularity in recent years, and Kafka simply wasn’t designed for the cloud.
What is Apache Spark?
Apache Spark is a unified analytics engine designed for large scale data processing. It’s a fast, open-source engine that has been a key player in the rise of big data analytics and machine learning.
Spark processes big data through distributed processing. This means computations are distributed across clusters of computers — essentially splitting the burden between them for faster results. Querying large datasets in Spark is also fast because it uses in-memory caching and optimized query execution.
Despite its success, Spark has limitations. Unlike data streaming, which is nearly instantaneous, Spark uses micro-batching, which requires a small amount of time (in the order of milliseconds). So although it comes close, it can’t be considered truly real-time.
You’ll see a much bigger time lag in Spark when you first start your job — up to ten minutes. And some jobs can take hours to complete. From an engineering standpoint, this means you have to wait minutes to find out if your code even works, and hours to see if it’s giving the output you want.
Spark is also a complex, low-level tool that requires lots of individualized code. This means companies usually need to hire specialist engineers to take advantage of it. Finally, Spark can be expensive because it requires a lot of RAM to create the caches that give it speed.
Combining Kafka and Spark to create a solution that streams and processes data may sound intuitive, but it’s more complex than it might seem. You’d need to fund a team of engineers to build infrastructure connecting the two products. And while the resulting system would handle your real-time workflows, it doesn’t even begin to scratch the surface of batch analytics — an entirely separate but equally important set of data analysis use-cases.
We launched Estuary because we’ve seen this problem time and time again. We believe that modern data pipelines can — and should — be better streamlined and easier to manage. Our unified data platform offers the best of real-time event processing and batch analytics while being accessible to both of their ecosystems, so you won’t have to replace all the infrastructure you’ve already built. It also gives instant feedback on code updates, which can save your team hours of waiting as you enhance your pipeline.
Our goal is to allow your engineering team to create and operate scalable data pipelines without spending valuable time and resources building the infrastructure that connects each component.