Or, how to take an organization-first approach to modern data architecture
The conversation about data mesh is perhaps as decentralized and hard to pin down as the concept itself. Articles (like this one!) have proliferated across online platforms, many of which center on one of three themes:
- What a data mesh is, and why it’s great
- What a data mesh is, and why it’s terrible
- How to implement a data mesh
It’s easy to get whiplash reading articles like these, especially the last variety, which seem like they should be directly useful.
The issue? A lot of what we read suggests that we can only reap the benefits of data mesh by radically restructuring our entire organization. Realistically, that’s not something most of us can just do.
A data mesh is characterized by self-serve data products intended for a variety of users. Applying that same principle, we should treat the data mesh concept as an adaptable, mutable “take what you need and leave the rest” paradigm.
The point is: there is a lot to learn from the idea of data mesh without re-building your company from the ground up. This has a lot to do with organizational thinking around how we manage teams and humans more generally. Let’s dive in.
Wait, what is a data mesh?
If you’ve missed all the discourse around data mesh for the past year and change (an admirable feat in and of itself) I’d like to capitalize on this unique opportunity to give you the quick lowdown.
First of all, here’s what a data mesh is not:
- A data mesh is not a concrete tool or product, like a database, warehouse, or analytics engine. It has more to do with data architecture and governance.
- A data mesh is also not a specific set of architecture or governance rules. There is no definitive checklist that you can go through to see if something is or is not a data mesh. Rather, data mesh is a conceptual framework.
If your reaction to this is to feel confused and hungry for concrete technical details, that’s probably normal. But as frustrating (and counterintuitive) as it may be, it’s ultimately helpful to think in high-level terms here, at least at first.
Now, the idea of the data mesh was pioneered by a technology consultant named Zhamak Dehghani and outlined in her 2019 article titled “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.” You should read this article when you have time, but I’ll summarize it quickly here.
In brief, Dehghani introduces the need for decentralized data ownership and distributed data across an enterprise, united by central self-serve infrastructure. The resulting data products are made accessible across functional teams by adhering to high-quality standards, governance, and metadata.
She breaks this down mainly by contrasting it with the data paradigm that remains the status quo for many enterprises. In this old paradigm:
- Data across the enterprise is all stored in a monolithic data lake, in which all the different modes of data usage are mixed up. A ton of client sources and destinations are connected to this monolith.
- The architectural quanta (units) are steps in the data lifecycle, for example: “ingest,” “process,” “serve.”
- The teams for the data sources, destinations, and the data platform itself are siloed. People of various job functions — data engineers, analysts, business stakeholders — are likely to be frustrated.
The data mesh paradigm significantly alters this approach by changing how we break down the units that make up an organization’s data ecosystem. With a data mesh:
- Data infrastructure is centralized and the organization adheres to global standards. But that is the only thing that’s centralized.
- Architecture is domain-driven. Cross-functional teams that are concerned with a specific domain (Dehghani gives the example of an audio streaming service with domains including “audio play quality” and “user audio play interactions”) are responsible for the entire data lifecycle for that domain. They host and serve it themselves, instead of just pushing it off to a giant data lake managed by some other team.
- The important quanta are now these business domains. Each domain makes its data available across the enterprise as an interoperable, highly visible data product.
Note that the data mesh approach is first and foremost a change in perspective about the best way to conceptually break down enterprise data platforms into useful units. Concrete changes in the organization should follow from this change in perspective, and Zhamak gives us plenty of criteria for evaluation. But her article is far from a “how-to” guide.
How do you actually “do” data mesh?
If there was a magic bullet, an easy step-by-step for how to build a data mesh, someone would have written about it by now. It could be argued that asking “How do I do data mesh?” is similar to asking “How do create a stack that meets rapidly evolving data demands in the 2020s?” There’s no right answer, but there’s plenty of guidance to be found.
Think of it as an organizational problem
The idea of data mesh follows naturally from data’s shift in status within the modern organization. Data has become increasingly vital for success, while simultaneously touching the day-to-day lives of more and more types of professionals than ever before.
In the past, data lived in an ivory tower, tended to by the specific few who had the very niche skillset required to do so. It was less pervasive; less mandatory across domains, so a monolith made sense. But now data has become ubiquitous in the enterprise at all levels. It is necessary for survival, and we have the tooling to make it accessible to all.
Given all of this, it makes sense to manage data in a way that mirrors the way we manage businesses, teams, and people. Doing so means that the people working on a given data product really understand its business usage in a way that an isolated, technical engineering team never could. This makes the product better. And because each team will inevitably care about its own data products, ownership feels, if not natural, then certainly important.
A few things to keep in mind:
- There’s no one-size-fits-all answer: The way you manage your organization, people, and other resources is unique. Your data strategy is no different.
- Embrace flexibility and growth: The only thing that we can be certain of is that more change is inevitable. Part of the problem with the data monolith is its rigidity. Data mesh, because it’s so modular, tends toward flexibility. Lean into that at every opportunity.
- Prioritize access: Data is a tool and resource shared by all. At the end of the day, the most important thing is that everyone can access and use the data they need. It can be surprisingly easy to get caught up on technical details and overlook this.
Seeing data through this lens is a distinct cultural shift, and is the reason why we have the term “democratizing data.” Cultural shifts are always met with resistance of one or more varieties (vocal, unconscious, simple inertia, etc), so you should expect it. This is not a simple matter of ripping out the plumbing and replacing it (although good plumbing is essential; more on that later).
Learn from others, focusing on themes and manageable changes
As I mentioned earlier, there is no shortage of articles, pre-recorded talks, and even Twitter threads about how organizations have implemented data mesh. But because there’s no such thing as a one-size-fits-all data mesh, expecting to follow another organization’s blueprint is likely to get you nowhere. This is especially true because often, the organizations that publicize their shift to data mesh are those with the most resources to make large organizational changes.
This is not to say that accounts of other organizations aren’t useful. On the contrary! But instead of searching for a blueprint, treat your research as a meta-analysis. Pick out the ideas that resonate most or are a fit for your scenario. Hone in on common themes. Focus on ideas that are actionable on a broad scale, or small actions that can have big impacts.
To get you started, here are some resources I found informative in my own research, and highlights that are broadly, realistically applicable. Note that this also includes a lot of editorializing; I suggest you read each original resource yourself!
Data domains and data products, Piethein Strengholt
- Clear definitions are key for success when it comes to domain boundaries, interoperability standards, and usage patterns. It’s really important to front-load your process with detailed planning, lots of documentation, and clear goals. This is a time investment from the leadership team, but pays dividends.
- Set the goal of creating useable, cross-functional data products. This is the heart of data mesh. Start with this and work backward. You may find that simplicity and governance win out over a dramatic revamp of all your teams.
- Have the same high standard of ALL data products. This means batch, streaming, and API-driven data. It should always be stable, and always compatible. This one can be hard, but it doesn’t have to be, and a lot of that depends on tooling (this may or may not be a shameless plug for Estuary Flow).
How we’re building our data platform as a product, Osian Llwyd Jones
- Use business tools like OKRs and KPIs, as well as UX design tools like user personas. I proposed earlier that data mesh is not just a data problem, it’s an organizational problem. And if data is a product, it has users.
- Be intentional about mapping tools to functionalities. Moving toward domain-driven architecture doesn’t mean we ignore the functional components of the data stack (storage, transformation, discovery, etc). If anything, each domain will have access to more of these. Being clear and intentional about which tool we use for what helps with cohesion.
Hellofresh journey to the data mesh, Clemence W. Chee and Christoph Sawade
- Really focus on standardization and clarity of ownership. Are you getting deja vu yet? Good.
- Bootstrapping teams with embedded data engineers before self-serve infrastructure CAN work, but it’s hard. Hellofresh seems to have found a strategy that worked for them, but they’re still short on data engineers. Unless you have a surplus of DEs on staff (and who does?) it might not make sense to have a platform that requires one on each domain team. Instead, focus on having great self-serve infrastructure and tooling that reduces the burden on DEs.
- Very explicit ownership can help. At JPMC, each data product was assigned both a business-side owner and a technical owner. This helps keep a grip on the boundaries and goals for each product.
- Don’t expect to get it perfect right away. Expect you’ll need to iterate, but know that every step away from the data monolith is positive progress
- Choose good tools. The organizational work of building a data mesh is critical, but many technical challenges will then arise (How to move data to and between various data stores? How to transform data? How is it accessed? etc). Putting it together requires you to carefully choose the right tooling.
Use data mesh concepts as a lens to evaluate tooling
With all this in mind, it’s time to shift the conversation to tooling. A data mesh is complex, multi-layered architecture; as such, it requires more than a few tools. Once you’re clear on your organization’s goals and your general architecture needs, it will be more clear which tools best meet your needs.
Because the tools on the market tend to mirror the current industry trends, it’s only natural that most mature data tools out there are designed for the data monolith. Many of them can still be applied to data mesh, but you’ll really want to keep your eyes on the forefront of tool development, both from established players and newcomers.
In a data mesh, it’ll be more important than ever for your tools to play well with one another in a way that affords you some flexibility. This requires openness and interoperability. As Mammad Zadeh eloquently puts it in his interview with Barr Moses:
“I’m also hoping that we would see more open standards at the infrastructure level to encourage vendors to move away from their walled gardens and build tools that could interoperate with other components from other vendors. I think having this kind of interoperable marketplace for solutions is critical for our industry.”
Another big priority you should keep in mind when you evaluate data mesh tooling is ease of use. Self-serve infrastructure should be self-serve for a range of employees, not just highly specialized data engineers. In many ways, that would defeat the purpose of the data mesh. It would also put more strain on data engineers’ schedules, which are often overburdened as it is.
To recap, an effective tool should:
- Be user-friendly and as intuitive as possible for both technical and less-technical users
- Be flexible
- Use open standards or values interoperability in some other way
- Reduce the burden on the engineering staff (or at least not increase it)
An all-purpose data mesh plan?
As we can see, there is no such thing as an all-purpose data mesh plan, in the technical sense. But from a high-level organizational standpoint, the steps to create one might be:
- Take stock of your current data stack, tooling, and most importantly, your people and team structure.
- With the goal in mind of central infrastructure and distributed products, evaluate your current organizational state. What would the data products be? Who should own each one? Who owns the self-serve infrastructure? Most importantly, how can you delegate new ownership without completely rearranging your teams?
- Which of your current tools are unusable in this new paradigm? Evaluate replacements, but don’t expect all changes to be one-to-one. Crowd-source from published use cases to get ideas.
- Decide on data product and metadata standards and document them meticulously before you actually do anything.
- Hold meetings, provide educational resources, and make sure you have buy-in from all stakeholders before you begin your migration.
All the while, keep your scope realistic. Shifts toward data mesh can be smaller, or take place in stages over a long period of time, and still be very much worthwhile.
Most importantly, remember that data mesh is about so much more than data. It’s about a healthy organizational structure; it’s about people, teams, trust, and efficient work in an era where knowledge is our most valuable shared commodity. That side of things is nebulous and hard to pin down, but it’s just as vital for success.
At Estuary, we focus on flexible, accessible (real-time) DataOps, so you can free up your time to focus less on technical roadblocks and more on running your organization. We’re on Slack, GitHub, LinkedIn, and Twitter; and recently launched our private beta.