Data pipelines are exactly what they sound like: a type of technology that moves data between a business’s many different databases, tools, and other systems.
Real-time data pipelines and batch data pipelines move and process data quite differently. Though both types of data pipelines are vital infrastructure that keep organizations running, they have different advantages and implications — and can sometimes be misunderstood.
In this article, we’ll break down how batch and real-time data processing work, and what this means for your data pipelines. We’ll look at the pros and cons of each, some simple use cases, and what they might mean for your data.
What is a batch data pipeline?
Batch data pipelines use batch processing, meaning they collect data over an interval of time and process it at a scheduled interval.
When you picture a traditional data analysis workflow — one where you ask questions of previously compiled data — you’re probably thinking of batch. Indeed, analytics and business intelligence are two areas where batch processing has been critical for decades, and remains central to this day.
Batch processing jobs are usually scheduled on a recurring basis. For example:
- A commercial business collects sales data, analyzes it daily in comparison to historical data, and produces reports that display trends and variables of interest.
- An electric company keeps track of usage and bills customers monthly.
Batch data benefits and when to use it
The main advantage of batch processing is its familiarity and simplicity. They’re relatively easy to build in-house, and many vendors sell batch pipeline services that are simple to deploy. If you have smaller or on-premise data infrastructure, it may be important to carefully manage compute resources. Because data collection and processing are distinct processes in a batch data pipeline, the processing job can be run offline on an as-needed basis.
The obvious downside to batch data pipelines is the time delay between data collection and the availability of results. However, there are plenty of use-cases in which immediate results are unnecessary or even counter-productive. Let’s use our previous examples to illustrate this:
- For the business collecting sales data, seeing that data for the 9 AM hour on a given day is unlikely to change what will happen at 10 AM. What is useful is a holistic look at sales over time, taking into account other variables, so the managers can learn to allocate resources and plan business operations accordingly. Each single data point is useless on its own, so there’s no need to rush to process it.
- For the electric company, sending a bill to its customers every time they turned the lights on and off would result in an outrageous amount of correspondence, frustrated customers, and many more opportunities for the system to break.
It’s also possible to get the batch processing intervals quite short: down to minutes in some cases. While instant data seems like a nice luxury, many organizations feel that it’s worth sacrificing a small delay for the simplicity of working with batch.
Even in these cases, the price tag on batch data pipelines can become problematic, especially when pulling from large database tables. That’s because all the source data must be queried each time the job is run.
What is a real-time data pipeline?
Real-time data pipelines use real-time data processing, also known as event streaming. It’s a method of processing data continuously as it is collected, within a matter of seconds or milliseconds. Real-time systems react quickly to new information in an event-based architecture.
While real-time data pipelines can be used for analytics, they are absolutely necessary for systems that must operationalize data instantaneously, for example:
- Rideshare services rely on location data, traffic data, and supply and demand data to instantly match a driver and rider, set prices, and calculate times.
- Banks need to apply machine learning algorithms to real-time transaction data to detect and prevent fraud immediately.
- Online stores must instantaneously manage inventory to avoid conflicts (such as selling the same item twice), monitor stock, customize a shopper’s experiences, and provide alerts.
Real-time data benefits and when to use it
As we can see, the benefits of real-time streaming and data analysis include the power to interrogate our data and automate logistical challenges immediately. Without it, many of the modern technologies we take for granted wouldn’t be possible.
Real-time data pipelines can also greatly reduce costs, especially for enterprises with huge datasets. Whereas batch data pipelines must repeatedly query the source data (which may be massive) to see what has changed, real-time pipelines are aware of the previous state and only react to new data events. That means much less processing overall.
However, implementing real-time data is complex from a data engineering perspective. Open-source streaming solutions can help, but aren’t nearly as easy as most batch pipeline tools on the market. Another major pain point is the fact that many real-time systems that currently exist don’t know how to handle historical data — anything other than what’s happening right now.
Real-time technology is changing
When it comes time to choose between a real-time or batch data pipeline, two of your focus areas should be:
- Timeliness: what are the temporal needs for your data? Can it be minutes or hours old, or must it be available in milliseconds? Do you need to backfill historical data?
- Cost: are you looking to cut costs in your data infrastructure overall? This is most significant if your organization has large data resources.
Fortunately, the data pipeline tools on the market are quickly becoming more sophisticated and more diverse. You may not have to make as many sacrifices as you think.
Batch-based ETL vendors like Fivetran are appealing due to their ease of use and out-of-the box integration with a wide variety of source systems. They offer flexibility and reduce latency significantly, though not to a matter of seconds.
Those who require real-time data often turn to Kafka or the Kafka-based service Confluent. This is the industry standard for event streaming, but it doesn’t do well with batch or historical data. Plus, Kafka in itself is extremely challenging to set up.
At Estuary, we believe you shouldn’t have to sacrifice time or money for convenience or simplicity, so we’ve made it our mission to bridge the gap.
Estuary Flow combines the ease of a batch platform with the power and cost savings of real-time. Flow is built on a streaming backbone, but can still connect to all systems, even SaaS tools that are usually restricted to batch pipelines. Its UI-forward web application is designed to democratize access to real-time data.
The Flow web app is available in beta. Get a free trial.