Let’s start by getting something important out of the way: there are some major differences between data engineering and software engineering.
At the same time, they’re similar enough that many of the best practices that originated for software engineering are extremely helpful for data engineering, as long as you frame them correctly.
In this article, I’ll walk through several software engineering best practices, and how they can help you create and maintain better data pipelines. I’ll be focusing on pipelines specifically because Estuary Flow will provide a handy example, but these principles apply just as well to your data stack at large.
The discussion will be high-level. I’m not a software engineer myself, and I don’t believe you have to be to gain strategic and leadership value from these principles.
Software engineering vs data engineering: similarities and differences
Data and software products are different, and their stakeholders are different.
Generally speaking, building a software product involves collaboration between highly technical teams. The product can be delivered to a huge variety of user groups, often commercially. For example, a bank might create a mobile application for its clients.
Data products, by contrast, tend to live within the confines of an enterprise. The stakeholders and players involved can range from highly technical engineers to non-technical professionals who need data to do their jobs. For example, that same bank might create financial and demographic data products about its clients to aid in security, sales, and strategy.
If you’re reading this, you hang out in the data space which means I probably don’t need to belabor these distinctions. You’ve witnessed firsthand how data can be treated as if it’s very different from software — especially from the business perspective.
But the essential practices of data engineering and software engineering are basically the same. You’re writing, maintaining, and deploying code to solve a repeatable problem. Because of this, there are some valuable software engineering best practices that can be converted to data engineering best practices. Lots of the latest data trends — like data mesh and DataOps — apply software engineering practices in a new way, with excellent results.
The history of software engineering vs data engineering
To understand why these best practices come from software and are only recently being applied to data, we need to look at history.
The discipline of software engineering was first recognized in the 1960s. At the time, the idea that the act of creating software was a form of engineering was a provocative notion. In fact, the term “software engineering” was intentionally chosen to give people pause, and to encourage practitioners to apply scientific principles to their work. In the following decades, software engineers tested and refined principles from applied sciences and mechanical engineering.
(Check out this Princeton article for more details on all these bold-sounding claims.)
Then, in the 1990s, the industry fell behind a growing demand for software, leading to what was known as the “application development crisis.” The crisis encouraged software engineers to adopt agile development and related practices. This meant prioritizing a quick lifecycle, iterating, and placing value on the human systems behind the software.
On the other hand, data engineering as we know it is a relatively young field. Sure, data has existed for most of human history and relational databases were created in the 1970s. But until the 2000s, databases were solely under the purview of a small group of managers, typically in IT. Data infrastructure as an enterprise-wide resource with many components is a relatively new development (not to mention one that is still changing rapidly). And the job title “data engineer” originated in the 2010s.
In short, software engineers have had about 60 years of doing work that at least broadly resembles what they still do today. During that time, they’ve worked out a lot of the kinks. The data engineering world can use that to its advantage.
Without further ado, here are some software engineering best practices you can (and should) apply to data pipelines.
1 – Set a (short) lifecycle
The lifecycle of a product — software or data — is the cyclical process that encompasses planning, building, documenting, testing, deployment, and maintenance.
Agile software development puts a twist on this by shortening the development lifecycle, in order to meet demand while continuing to iterate and improve the product.
Likewise, you can — and should — implement a quick lifecycle for your data pipelines.
The need for new data products across your organization will arise quickly and often. Make sure you’re prepared by dialing in your lifecycle workflow.
- Plan with stakeholders to ensure your pipeline will deliver the required product
- Build the pipeline — For example, in Estuary Flow, you publish a specification.
- Document the pipeline — For example, in Flow, you’ll end up with a YAML catalog of your pipeline specification, and a descriptive JSON schema for each data collection.
- Test the pipeline before deploying — Pipeline tools like Flow or orchestration tools like Airflow make this possible.
- Deploy the pipeline.
- Monitor it — Watch for error alerts and make updates.
- Iterate quickly as use cases change — In Flow, you can edit and mix-and-match pipeline components.
The concept of integrating agile development methods into data is a huge component of the DataOps framework. Check out my full article on the subject.
2- Pick the right level of abstraction
To keep your data lifecycle tight, it’s important not to get lost in the technical implementation details. This calls for abstraction.
Software engineers are quite comfortable with the concept of abstraction. Abstraction is the simplification of information into more general objects or systems. It can also be thought of as generalization or modeling.
In software engineering, the relevant levels of abstraction typically exist within the code itself. A function, for example, or an object-oriented programming language are useful tools, but they don’t reveal the fine details of how they are executed.
In data, you’ll need to work with a level of abstraction that’s higher than code. There are two main reasons for this:
- The immediate connection between data product and the business use-cases they serve mean you’ll want to talk about data in more “real-world” terms. Getting clear on this level of abstraction means establishing a universal semantic layer — and helps avoid the common problem of multiple, conflicting semantic layers popping up in different BI tools and user groups.
- The wider variety of technical levels you’ll find in data stakeholders means that talking in terms of something highly technical, like code, isn’t very useful.
Again using the example of Flow, the abstractions you’ll want to focus on are the tasks in your pipeline. A capture is a task that ingests data from an outside source into a data collection. A materialization pushes that collection to an outside destination.
When we talk about pipelines in terms of tasks like captures and materializations, both engineers and business users are able to unite around the semantic value of the pipeline (it gets data from system X to system Y so that we can do Z).
3 – Create declarative data products
Ok, you caught me, this is really just a continuation of the discussion of abstraction, but it will give that discussion more substance.
Let’s consider the idea of data as a product. This is a central tenent of the popular data mesh framework.
Data-as-a-product is owned by different domains within the company: groups of people with different skills, but who share an operational use-case for data. Data-as-a product can be quickly transformed into deliverables that can take many forms, but are always use-case driven. In other words: they are about the what rather than the how.
The software engineering parallel to this is declarative programming. Declarative programming focuses on what the program can do. This is in contrast to imperative programming, which states exactly how tasks should be executed.
Declarative programming is an abstraction on top of imperative programming: at runtime, when the program is compiled, it will have to settle on a how. But declarative programming allows more flexibility at runtime, potentially saving resources. Plus, it’s easier to keep a grip on mentally, making it more approachable.
By making your pipelines declarative — built based on their functionality first rather than their mechanism — you’ll be able to better support a data-as-a-product culture.
You’ll start with the product the pipeline is intended to deliver; say, a particular materialized view, and design the pipeline based on that. A declarative approach to pipelining makes it harder to get lost in the technical details and forget the business value of your data.
4 – Safeguard against failure
Failure is inevitable, both in software development and data pipelines. It’s a lesson many of us have learned the hard way: scrambling to fix a catastrophically broken system, losing progress or data to an outage, or simply allowing a silly mistake to make it to production.
You can — and should — apply very similar preventative and backup measures in both software and data contexts.
Here are a few important considerations. Many of these functions can be fulfilled with a data orchestration tool, but I’ll discuss how each can be completed using Flow.
This should be part of your pipeline’s lifecycle, just as it is in software.
In Estuary Flow, there are two types of tests:
- The automatic testing of your configuration before deployment.
- Tests you customize to make sure data transformations are performing as expected.
As a rule of thumb, the more transformations a data pipeline applies, the more testing is required.
Software engineers use version control, usually Git, to collaborate on their work and retain the ability to roll back software to previous versions.
In Flow, inactive tasks are perpetuated in the webapp until you delete them. Flow also offers a complete GitOps workflow, meaning engineers can use Git to collaborate on pipelines in their preferred development environment.
Even if you can’t use Git for your data infrastructure, you should retain access to previous versions of a data pipeline.
Distributed storage and backfilling ability
The advent of cloud hosting and storage has lessened the danger of outages and data loss, but it hasn’t gone away. Your data infrastructure should be distributed; that is, different components should be spread across different servers, making it fault tolerant.
Flow is a distributed system that transcends cloud provider regions, so it’s resilient to outages. It’s also designed for large backfills of historical data, so when pipelines do fail, they can be recovered quickly.
One final lesson from software engineering best practices: when something doesn’t work, iterate.
The status quo and best practices are always in flux. This applies to software engineering and it definitely applies to data engineering.
The best approach is always one that’s well thought-out, introduces change safely, and includes buy-in from all stakeholders.
Start with principles like these, work with them to fit your data team’s systems and culture. Note the positive effects and areas that need improvement, and go from there.
To learn more about the concepts behind the many examples we sped through in this article, check out the Flow concepts documentation.