Real-time and batch are two broad categories of data processing. Though they handle data differently, both are vital to the systems that make our businesses and society run.
At Estuary, we talk a lot about unifying both methods into a single data pipeline. But why does that matter? To understand better, let’s break down real-time and batch: how each method works, their benefits and drawbacks, and sample use-cases.
What is batch data processing?
Batch processing involves collecting data over a period of time and processing it all at once.
When you picture a traditional data analysis workflow — one where you ask questions of previously compiled data — you’re probably thinking of batch. Indeed, analytics and business intelligence are two areas where batch processing has been critical for decades, and remains central to this day.
Batch jobs are usually scheduled on a recurring basis. For example:
- A commercial business collects sales data, analyzes it daily in comparison to historical data, and produces reports that display trends and variables of interest.
- An electric company keeps track of usage and bills customers monthly.
Batch data benefits and when to use it
The main advantage of batch is its simplicity. Compute resources become easier to manage, and because data collection and processing are distinct processes, the processing job can be run offline on an as-needed basis.
The obvious downside to batch is the time delay between data collection and the availability of results. However, there are plenty of use-cases in which immediate results are unnecessary or even counter-productive. Let’s use our previous examples to illustrate this:
- For the business collecting sales data, seeing that data for the 9 AM hour on a given day is unlikely to change what will happen at 10 AM. What is useful is a holistic look at sales over time, taking into account other variables, so the managers can learn to allocate resouces and plan business operations accordingly. Each single data point is useless on its own, so there’s no need to rush to process it.
- For the electric company, sending a bill to its customers every time they turned the lights on and off would result in an outrageous amount of correspondence, frustrated customers, and many more opportunities for the system to break.
In other cases, however, it’s absolutely critical to reduce or eliminate the time lag for data. That’s where real-time processing comes in.
What is real-time data processing?
Real-time processing is a method of processing data continuously as it is collected, within a matter of seconds or milliseconds. Real-time systems react quickly to new information in an event-based architecture.
While it can be used for analytics, the best examples of real-time data are systems that couldn’t operate without instantaneous processing, for example:
- Rideshare services rely on location data, traffic data, and supply and demand data to instantly match a driver and rider, set prices, and calculate times.
- Banks need to apply machine learning algorithms to transaction data in real time to immediately detect and prevent fraud.
- Online stores must instantaneously manage inventory to avoid conflicts (such as selling the same item twice), monitor stock, and provide alerts.
Real-time data benefits and when to use it
As we can see, the benefits of real-time streaming and data analysis include the power to interrogate our data and automate logistical challenges immediately. Without it, many of the modern technologies we take for granted wouldn’t be possible.
However, real-time systems require continuous input, processing, and output, and thus must always be kept running and in working order. What’s more, they’re notoriously challenging to implement. Not only must events be streamed; they must also be transformed in real time. Another major pain point is the fact that most real-time systems don’t know how to handle historical data — anything other than what’s happening right now.
Many organizations that don’t truly need millisecond latency data wonder if it’s necessary to go to all the trouble. However, new advances in data pipelines and managed data infrastructure services can mitigate these challenges.
The takeaway: flexible timing matters
The key to success with your data architecture is to consider the timeliness you need from your data. Most organizations fall somewhere between the examples we’ve discussed so far: they don’t necessarily need data within milliseconds, but daily is too long of an interval.
Consider your exact use-case. How fresh does your data need to be? Can it be minutes or hours old? Maybe you require data from one particular source in real-time, but you’re OK with latency in others.
Fortunately, solutions exist for all of these use-cases.
Many ETL vendors, like Fivetran and Stitch, are batch tools, but they’re a far cry from an old-school batch process that crunches numbers using on-prem servers during down time. They offer flexibility and reduce latency significantly, though not to a matter of seconds.
All of the above are tools that focus on data loading — moving it between systems. The real customization comes when you choose transformation engines, storage systems, and other components to build a modern data stack.
Although there’s such a high degree of customizability, there’s still a rift between modern batch and real-time workflows.
Estuary is bridging that gap by building a unified pipeline, Flow. Flow streams data as soon as it receives it from the source — it can stream data within milliseconds, or wait for a source with higher latency. It will never add latency, always adapts to the scale of the source system, and knows how to use historical data. A central tenet is openness, so you can connect to all systems, even SaaS tools that are usually restricted to batch pipelines.
Check out these resources to learn more: