Every modern business is a data-driven business.

The amount of data in existence is currently growing exponentially, and it’s become almost impossible to keep up in any industry without using this valuable resource. 

Well-managed data makes every process more efficient, from marketing to manufacturing. Of course, you’ve heard this sort of buzz before. Data is the lifeblood of a modern business — or, to use another analogy, it’s like the water in your home.

What is a data pipeline?

The water in the water main under your street is useless to you until it comes out of your tap. If it doesn’t pass through the heater on the way to your shower, you’ll probably be pretty upset. And if a pipe bursts in your wall, you’ll stop whatever else you’re doing to deal with it.

The term “data pipeline” evokes a mental image that speaks for itself. Like water, data is rarely stagnant. It must travel between disparate systems, and often must be cleaned or transformed along the way. You don’t want your data to lag, get stuck, or arrive at its destination in an unusable condition.

But what exactly is a data pipeline? We can define it as a technological pathway that captures data from a source and delivers it to a destination in a desired state

This doesn’t “just happen;” it takes well-designed infrastructure. Like most well-designed infrastructure — pipes, roads, the electric grid — the mark of a successful data pipeline is that it’s easy for end-users to forget about. 

Of course, this is a simplistic explanation. Let’s take a moment to appreciate how complex real-world use cases can get.

Organizations need to manage data throughout its lifecycle. They must:

  • Capture data from many external sources.
  • Transform the raw data into actionable information through analysis and visualization.
  • Store data in a secure, accessible manner so it can power continuous insights now as well as historical analysis in the future.

In reality, these three simple buckets break down into a huge variety of workflows. You might need to capture data with millisecond latency (making it a real-time data pipeline), or you might prefer to check for new data at intervals (batch processing). Captured data often comes from dozens of sources, in dozens of formats. For example, customer data scraped from your website will look different than readings from a physical sensor. 

(For a complete intro to real-time vs batch, check out this post.)

Data transformations range from minor reformatting to big-data analysis. The data you transform might be captured directly from the external source, be output from a previous transformation, or come from cloud storage. All the while, the most up-to-date data must be continuously synced across all your storage to ensure accurate, current record-keeping.

Data pipelines vs ETL

ETL, which stands for extract, transform, load, is an industry shorthand for a specific type of data pipeline. You may have heard the term “ETL pipeline.” Sometimes, the two terms are used interchangeably, although they’re not always the same; not all data pipelines are ETL. 

You’ll notice that we already covered all these functions of a data pipeline in the previous section, we just used different words: “capture” for “extract” and “store” for “load.” The word choice doesn’t matter nearly as much as the order. 

ETL pipelines transform data before they load (or store) it to the final endpoint. This has a major implication: it introduces a time lag. Thus, ETL pipelines are traditionally a form of batch processing, not real-time. 

Of course, the categorization of ETL is a bit more complicated than that. The same is true of its counterpart, ELT. We have a separate post where we break down the finer details of the paradigm, which you can check out here.

Modern data infrastructure and the data pipeline

data infrastructure.jpg

The data technology space is evolving extremely quickly. When it comes to data pipelines, we’re at an interesting juncture.

We have the technology to build unified, streamlined data pipelines that hold all of a company’s data systems together. In reality, though, we often fall short of that goal. 

Many organizations have multiple high-maintenance, disjointed pipelines. This forces engineers to constantly work to catch up with their own data, new pieces being added to their data stack, and ever-evolving industry standards.

It’s worth asking: how did we get to this point? The short answer: rapid industry growth, and the associated growing pains.

Much of the change that’s occurred in the history of data infrastructure has happened rapidly in the past two decades. It wasn’t long ago that all data was stored in highly structured databases on physical servers. Unlike today’s cloud, these were limited in computing power and storage. While you couldn’t do nearly as much with your data, the systems were easier to manage. 

Then, at a relatively rapid pace, technologies emerged that would allow organizations to scale both the amount of data they could store and the ways in which they could process it. Open-source offerings proliferated. Unique data stacks combining SaaS products, analysis tools, and storage solutions became the norm.

The advantage to this, of course, is that organizations could tailor their systems to suit their needs. But connecting these different components with functional data pipelines became challenging. With the pace of change today, it’s been hard for teams to find their footing.

That’s why it’s still common for businesses to be stuck with a spiderweb of confusing data pipelines that barely manage to support their systems. They end up hiring data engineers to build these custom solutions and fix things when they break, but the problems never end.

This expensive task is made more challenging by the shortage of data engineers in the job market. More importantly, with a good pipeline, it shouldn’t be necessary. 

What data pipelines do

Once implemented, a good data pipeline should do the heavy lifting without creating mountains of work for your engineering teams. This allows your team members to focus on what’s important — the success of your business — and not on managing infrastructure. 

High-quality, unified data pipelines have the following functions:

  • Ingest structured and unstructured data instantaneously from all sources.
  • Transform, prepare, and enrich data as needed
  • Materialize captured or transformed data into storage
  • Handle both batch and real-time workflows 
  • Intelligently adapt to network lags or schema changes, and alert you of significant issues
  • Flexibly allow future customization without breaking 

How to build (or re-build) a data pipeline 

Still not sure if your organization needs a better data pipeline? Check if any of the following sound familiar.

  • You create, use, and store large volumes of data 
  • Data-driven insights are key for your organization’s success, but you sometimes struggle to get them in time
  • You use multiple storage and analytical systems that are currently siloed
  • Your current data infrastructure seems more like an IT liability than an asset

Some organizations successfully build and maintain in-house data pipelines. They tend to be companies with tons of data and huge resource pools for managing it — think Spotify and Netflix. For these companies, hiring whole teams of engineers makes sense and works within their budget. 

For most organizations, the in-house method isn’t optimal for reasons we’ve already discussed: it’s costly, pulls team members’ attention away from their actual jobs, and can force additional hiring.

To return to our metaphor: it’s like maintaining an old building with old plumbing and wiring that’s not designed for modern life. 

But what if there was a relatively easy way to replace the plumbing? This can be done by working with new data technologies on the market. 

The last few years have seen a new wave of data infrastructure startups that were built to alleviate the growing pains of the data boom. Offerings range from open source connectors (think of these as pipeline components that link two specific systems) to custom pipeline deployments. 

Estuary’s DataOps platform, Flow, gives business stakeholders and data engineers a common platform to collaborate on customizable real-time pipelines. It’s built on a scalable streaming broker and uses an ecosystem of open-source connectors


Flow is in private beta, and public beta is coming in 2022.

Comments

  1. Pingback: Real-time and batch data processing: an introduction

  2. Pingback: The costs of data integration explained, and how to minimize them

Leave a Comment

Your email address will not be published. Required fields are marked *