When you choose the platforms and systems that comprise your modern data infrastructure, the pricing model is a major factor to consider.
Your data warehouse and data pipelines handle a ton of data every day, and they can quickly make or break your budget. Evaluating exactly how and why you’ll be charged is a crucial planning step.
The pricing conversation arguably doesn’t get the attention it deserves. Instead, the rhetoric often gets fixated on a system’s performance, shiny new capabilities, and the “right” way to use it in our data stack. (Data vendors are particularly guilty of this redirection.)
At the end of the day, your data’s job is to drive your business’s bottom line. And if you have to fork over hundreds of thousands — or even millions — of dollars to vendors each year to get the data outcomes you need?
Well, you might not actually be meeting your business goals at all.
In this article, we’ll look at the popularity of opaque pricing models for data warehouse and data pipeline platforms, and how that can be problematic. Along the way, we’ll break down other aspects of data platform pricing: compute-based, storage-based, and the variations therein.
By understanding the pricing models popular in the market today, you’ll be able to choose your vendor more wisely. Even if you end up choosing a product that has a less-than-stellar model, you’ll know what to expect and be better equipped to negotiate your contract.
Understanding the market through data warehouse pricing
Data warehouse pricing models are discussed more than data pipeline pricing models, but the two are closely related.
Often, the main function of the data pipeline is to move data into the warehouse. So, the two systems will be purchased by similar customers and handle similar quantities of data.
This target customer? Mostly large enterprises, plus some small and medium businesses.
The quantity of data? Massive. In the terabyte scale — and quickly growing.
That’s the key. The target market of a data warehouse (and connected technologies) is a company that already has tons of data, and will continue to acquire and use more and more data as time goes on.
The 2021 SODA Data and Storage Trends surveyed enterprises on their total data growth. It found:
“Mainstream annual data growth is between 1-100 TBs per year, as reported by 62% of the sample. However, 9% of the sample is seeing annual data growth of 1PB or more. This is ten to 100 times greater growth than the mainstream and is likely a harbinger of where many enterprises will find themselves within a few years.”
Of course, this is total data, not data stored in the warehouse, but it’s indicative of the overall trend in the industry. The survey also found that 46% of respondents are running workloads on data warehouses “all the time.”
Data warehouses are the cornerstone of the Modern Data Stack, and studies also show that data warehouse sales are booming. Companies across industries need data warehouses to support functions like advanced analytics, low-latency views, and operational analytics — which are quickly becoming non-optional.
Vendors are well aware of this. They also know that once you migrate to a product, you get locked in — you rely on that product and build your infrastructure around it. Their potential profits in the long run depend on how they structure their pricing models. As your total data volume grows, and the number of data sources and data tools you use increases, your warehouse and pipeline bills may increase more than you expect.
But once you’re locked in, you’re more likely to tolerate a growing bill than migrate to a different platform.
Compute vs volume-based data warehouse pricing models
With this in mind, let’s look at some common data warehouse pricing models.
There are typically two components to your data warehouse bill: storage and compute.
Storage pricing is pretty straightforward: it’s undeniably a matter of data volume. You can measure the number of bytes of data stored, and charge on that. That’s to be expected, so we’ll set this topic aside.
Charging for warehouse compute, on the other hand, is where things get interesting.
Data warehouses store data in a way that’s designed for analytical query performance. You specifically put it there so you can query it, and these queries have a cost.
The warehouse vendor can charge you for compute in two main ways:
- On the volume of data scanned by running the query.
- On the actual compute used to run the query.
BigQuery: Charging by scanned data volume
Google BigQuery charges by terabytes read during query processing. This is a straightforward, predictable model used by many cloud providers.
But charging for queries by data volume creates a dilemma for Google. If an engineering team at Google found a way to make their queries more efficient — usually the goal of any querying technology — it could actually be bad for their bottom line.
In fact, some users have noted interesting behaviors when querying in BigQuery that cause it to scan entire datasets unnecessarily, but which are suspiciously not resolved in production.
Snowflake: Charging by compute… by “credits”
One could argue that it makes more sense to charge based on the actual compute resources the vendor uses to run your query. The vendor, after all, is paying to support the compute resources that you consume, so tying the pricing to this could encourage a fairer model, and doesn’t actively encourage companies to make operations less efficient.
The important thing is that the company be transparent about quantifying what these compute resources actually are.
To be fair, measuring compute is a little more complex than measuring data volume. It comprises many factors, including processing power, networking, and memory. But this doesn’t relieve a vendor of the responsibility of documenting the technical details of its pricing structure.
Let’s look at an example. Data warehouse vendor Snowflake has seen tremendous profitability in recent years. A particularly interesting metric is its net revenue retention — As of a Q1 2022 estimate, Snowflake showed 174% year-over-year revenue growth from existing customers alone, far surpassing its peers.
Of course, much of this has to do with the quality of the product and the aforementioned growth in enterprise data volume. Still, statistics like these raise eyebrows, and Snowflake has come under some scrutiny for its pricing.
Snowflake charges compute through a unit called a “credit.” Credits are kind of like Monopoly money. They aren’t directly tied to anything quantifiable in the real world: the hardware used, say. This lack of transparency raises some red flags. How can a customer truly know if they’re paying a fair price for what Snowflake is doing under the hood? (They can’t.)
Now that we’ve looked at simple volume-based pricing as well as credit-based pricing for data warehouses, let’s apply these principles to data pipeline pricing.
Row-based pricing for data pipelines
Unlike warehouses, compute isn’t really relevant to data pipeline pricing. Just about all vendors charge based on data volume. But despite data volume being a relatively straightforward metric, many pricing structures introduce complications, as we saw with Snowflake’s credit model.
Instead of charging on pure volume, many vendors charge by row. This makes some sense — the number rows of data read by a data pipeline is quite easy to predict and estimate, and most batch pipeline tools think of data in terms of rows.
Popular vendors including Fivetran, Airbyte, and Stitch Data use variations of row-based pricing.
With row-based pricing, you can safely predict a linear relationship between the amount of data you ingest and the price of running the pipeline. But you may not be able to predict the details of that relationship.
That’s because row-based pricing is a proxy for volume. A few issues with row-based pricing for data pipelines include:
- Not all rows are the same size.
- For smaller data sources (like SaaS apps) the overall data volume is typically quite small; using rows gives the vendor wiggle room to charge more for each small integration.
- Data source systems model data differently. Some aren’t row-based; the way the data pipeline re-shapes this data into rows introduces another layer of complexity.
Rows, while seemingly a simple way to charge on data volume, are actually less transparent than they may seem.
On top of that, many vendors add additional abstraction on top of this pricing model, often in the form of — you guessed it! — credits.
Volume-based pricing for data pipelines
When any data vendor charges on a proxy, such as rows or credits, you lose some amount of agency as a consumer. This doesn’t necessarily mean you’ll be taken advantage of, but it does mean that it’s your responsibility to inquire with your sales representative to determine exactly what you’ll be paying for.
You should feel free to negotiate, especially if you’re presenting a large account with a huge volume of data.
The ideal situation, though, is to find a data pipeline platform that charges on pure data volume.
When you’re charged on pure volume:
- You won’t overpay for adding many, smaller data systems to your pipelines.
- You can more easily predict what you’ll pay, no matter how your source data is modelled.
- You won’t have to be quite as hyper-vigilant to ensure you’re not being overcharged.
At this point you might be wondering: what is a fair price per unit of data volume in a data pipeline?
That’s a hard question to answer: it depends on the market and will change constantly over time.
It also depends on the vendor’s profit margins, which, in turn, have to do with whether or not the pipeline architecture they provide is efficient and highly scaleable.
In an ideal world, your vendor is likely looking to reduce its own costs of operation. And if it does so, it can transfer those savings to you while still turning a profit, and gain more customers by being known for its reasonable pricing.
In any case, your vendor should absolutely offer high-volume discounts that will take effect as your data volume grows.
How to evaluate the price of big data tools
When shopping for data warehouses and data pipeline tools, your best bet is to get quotes from multiple vendors and compare.
As you do so, here are four things to consider:
As we showed in our discussion of data warehouse pricing, almost any pricing model can encourage the vendor to turn a blind eye on certain inefficiencies in the system.
Of course, the goal should be to do business with companies with super-solid products that you trust, but you should always stay on your toes.
Ultimately, a B2B customer-vendor relationship can be positive and mutually beneficial. But this requires careful negotiation, honesty, and attention to detail on both sides.
On that note:
Avoid opaque pricing models
Your data bill is guaranteed to grow over time, because your data will grow over time. Create a best-case scenario by only agreeing to pricing models that are transparent and scalable. Avoid models with added layers of complexity (like credits or rows) that can obscure what you’re actually paying for.
And as time goes on, continue negotiating…
Ask for high-volume deals.
They are out there, and the best ones are probably not advertised.
Finally, if you take just one thing away from this blog post, it should be this:
While evaluating pricing, think realistically about your data’s future.
Your organization probably has a lot of data now. It is highly unlikely that it will ever have less. So, ask yourself: does the pricing model scale in a way that is beneficial to you, or to the vendor?
Flow is a data integration platform charged per GB of data flow on a monthly basis. You can learn more about our pricing model or try it for free.