ETLBudgetingCostdata infrastructureData integration

9 min read

September 29, 2021

The costs of data integration explained, and how to minimize them

There is a cost associated with putting your data to work, and the benefits you gain depend on the systems you put into place. To maximize net value, you need to strike a balance between minimizing costs and maximizing gain.

Olivia Iannone Technical Writer

Share this article

Summarize this page with AI

Start Building For Free

In our last post, we touched on the ideas of cost and value as they relate to a big-name event-streaming platform. Today, we’ll take that discussion further and apply it to your enterprise data integration strategy more broadly.

Data is a valuable commodity — you may have heard it said that data is the new oil. This may be gimmicky, but it’s true. Data must be extracted, processed, transported, and put to effective use. The upshot? There is a cost associated with putting your data to work, and the benefits you gain depend on the systems you put into place.

To maximize your net value, you need to strike a balance between minimizing costs and maximizing gain.

What is data integration?

Data integration is the process of combining and connecting data from various sources. Every enterprise has what’s referred to as a data stack: a collection of systems for data extraction, analysis, and storage. Proper data integration unifies all of these systems, giving you a single source of truth and, ideally, allowing data to flow freely between systems.

At Estuary, we specialize in data integration by providing tools and services for flexible, unified data pipelines. Data integration is one part of the wider world of data engineering: a quickly growing field that is critical for enterprise success, but also rife with challenges. While the themes we’ll discuss here could just as easily be applied to data analysis, storage, and collection, we think data integration is a great place to focus your efforts when it comes to cost and value optimization. That’s because it’s such a critical and all-encompassing aspect of your data stack.

Costs of data integration

We spend a lot of time touting the value we can gain from effective data integration: data-driven business insights, agility and adaptability in real-time… the list goes on.

But we may not spend as much time thinking critically about its costs.

Like anything in your business, the costs of data integration can’t be thought of simply in terms of bills that you pay. The time and effort that you and your team members put in also have a direct financial implication: that time and effort could be spent on other valuable projects, and, of course, there’s compensation to consider.

The exact accounting of all this — employee’s time, bills, resources — depends on how your organization is set up. But we can break down the costs into a few straightforward categories.

Infrastructure and tooling

This is the upfront cost of setting up your data integration solution. Here are some of the components you’ll need:

Physical or cloud-based resources to run the system. Though separate from your data storage, you’ll still need a data warehouse or data lake to handle intermediate data as it moves through your pipeline.
The pipeline or streaming platform itself. The solution can be built completely in-house, purchased as a service from a vendor, leverage open-source solutions, or a combination. This can be an ETL tool or a differently structured but related system.
Connectors. These are components that speak a common language and provide an interface between the centralized pipeline and your other systems.

You’ll also need someone to do the work to get things up and running, which we’ll discuss more below.

Generally speaking, the solutions with the lowest up-front price tag will involve the largest amount of effort from your team. On the other hand, you can hire a data integration consultancy or vendor to build an entire data pipeline for you.

Maintenance and upgrades

Setting up your data infrastructure solution is just the beginning. You’ll incur costs on an ongoing basis throughout its lifecycle. These include:

Licenses and storage fees. You’ll need to regularly pay to maintain your storage as well as the licenses for any SaaS or vendor services you’re continuing to use.
Scaling infrastructure. As time goes on, you’ll accumulate more and more data. You may find that you need to increase the capabilities of your pipeline to handle more data flow. Your external storage systems are also impacted by the amount of data coming through your pipeline and may need to be expanded as well.
New connectors and solutions. As you adopt new systems throughout your data stack, you’ll need to do work to integrate them into your existing pipeline. This involves careful work to avoid breaking the existing pipeline, testing, and monitoring.

Personnel

Unfortunately, most data infrastructure solutions are quite difficult to implement and maintain. Unless you’re relying entirely on an outside consultancy or vendor for help, you’ll likely need to hire at least one data engineer.

In this career currently, demand far exceeds supply, not to mention, data engineers earn salaries well over six figures.

Mistakes and false starts

If you spend time building a data integration solution, and then it breaks, doesn’t meet your needs, or doesn’t scale, this can amount to a waste of resources. You’ll have to start over and spend that time, money, and effort again.

Of course, every budget and business plan must include some allowance for losses. Mistakes and inefficiencies are inevitable to a certain extent, but they can be minimized with the right planning and choices. That’s what we’ll get into next.

How to minimize data integration costs

Make a realistic plan and strategy. To succeed in data management, you must begin by treating it as a business problem. Gartner Research makes that case quite well in their article on cost optimization for data management (it also contains a bunch of techniques that make it a worthwhile read).

Define your data goals and devise a strategy to meet them. Research the actual costs involved, and be realistic about the labor involved; it’s not uncommon to realize deep into a project that you don’t have enough data engineers on staff to handle it.

Prioritize data quality. The accuracy, completeness, orderliness, and timeliness of your data should always be a top priority. Make sure the solutions you choose do not jeopardize your data!

Data quality is directly related to business value. In 2020, businesses surveyed by Gartner lost, on average, $12.8 million to poor data quality.

Choose adaptable, scalable systems. When choosing a data integration method, don’t just consider the current state of your organization. Also leave room for the possibility of growth and change. Can the solution handle much larger volumes of data than you have currently? Is it relatively easy to connect to a variety of data systems, even those that you don’t currently plan to use?

If the answer is “no” to these questions, it might be worth investing a bit more up front to find a solution that will have fewer growing pains in the future.

Choose a unified data integration solution. Using a multitude of licensed tools, storage solutions, and services is a common way for enterprises to rack up huge recurring bills. To reduce this, de-fragment your data infrastructure as much as possible.

Wherever it makes sense, choose open-source, cheap, and free systems. But when you’re looking for the central “backbone” that will tie everything together, prioritize unification over all else. This will prevent you from purchasing more services and solutions to patch together your data infrastructure in the future.

Pricing models for common data integration tools

If you’ve read this far, you’re probably thinking: Ok, that’s all well and good, but how much do these Data Integration Tools and services you’re talking about actually cost?

Without being a customer of the various vendors, it’s hard to predict what various data integration platforms actually cost. This is for a few reasons. Firstly, each data infrastructure use case is very unique, so it’s hard to succinctly explain pricing online. Second, this industry is changing rapidly, so what’s a fair price today may change in a few months.

All that being said, a smart approach to finding the most optimal situation for your company could be summarized as follows:

Learn about the pricing models of solutions you’re considering. This information is more public, and can shed light on what you’ll expect. Use the table below as a guide.
Evaluate candidates to make sure they’ll deliver exactly what you need. Since the precise cost may be unknown up front, it’s vital to make sure you’re making a choice that will help you meet your goals and provide the maximum amount of value.
Once you’ve narrowed down the candidates, set up a consultation to discuss your use-case and the price you can expect. It’s best to narrow the field down before you reach this step to save yourself a lot of time on correspondence and phone calls.

Company or solution	Service offered	Pricing model	Comments
Snowflake	Cloud data warehouse and processing	Credit system: you purchase credits; credits are consumed based on time and resources used	Can provide scalable data backbone but additional connectors solution are required to integrate with external systems
Bigquery	Cloud data warehouse and processing	Analysis and storage priced separately. Flat-rate and on-demand models available.	Can provide scalable data backbone but additional connectors solution are required to integrate with external systems
Kafka	Open-source framework for real-time event streaming.	Free	Not a complete solution; extremely high engineering workload required to build a data pipeline.
Confluent	Provides managed implementations of Kafka	Priced in tiers; estimation tool available on website.	Additional price for each connector to external systems.
Fivetran	ETL platform and data pipelines.	Priced by monthly active rows.	Exclusively batch-based; limited speed
Airbyte	Open-source connectors. Many available and quickly growing.	Free	Exclusively batch-based; limited speed and scale. Must implement on your own.
Stitch	ETL platform with OSS extensibility	Standard pricing by rows, or custom pricing for enterprise cases.	Exclusively batch-based; limited speed
Estuary	Real-time data pipelines and open-source connectors	OSS offerings free; managed service billed by number of records.	See below

Streamlined, affordable data integration with Estuary

At Estuary, we’re all too familiar with the challenges of building modern data pipelines while staying within budget. That’s why we’ve created our managed service, Estuary, and a growing ecosystem of open data connectors.

Estuary allows unification across all your systems in real time, is highly scalable, protects your data, and doesn’t require a whole team of engineers to stand up.

If you’re working on cost analysis for your organization’s data integration process, we’d love to hear from you.

You can also try the managed Estuary web application for free here.

About the author

Olivia IannoneTechnical Writer

Olivia Iannone is a technical writer who creates clear, accessible content on data engineering, real-time systems, and developer tools.