By David Yaffe, CEO, and Johnny Graettinger, CTO
Hi, we’re Estuary.
We’re building a data operations platform — a tool that helps you synchronize your datasets across the systems where they live, and the systems where you want them to live, with millisecond latency.
Our platform unifies today’s batch and streaming data paradigms, supporting both historical and current data in the same low-latency data pipeline. We’re also growing an ecosystem of open-source data connectors that work in real-time and are compatible across platforms.
This type of work — building applications that collect and move records between systems, maybe with transformations applied along the way — is hard to do by hand. If you’re a developer, you’ll be familiar with the minefield of challenges that comes with it.
From an organizational standpoint, you understand the business importance of well-cataloged and discoverable data products, and the fallout that can ensue if these products break in the face of change.
Our mission and guiding principles are defined not just by the common challenges of our field, but also by our personal experiences. That’s why we want to discuss how we got here, the beliefs we hold as a result, and the vision we work towards.
We’ve been working together to build low-latency, high-scale data products for 12 years.
Our first startup, Invite Media, was a DSP that aggregated advertisement opportunities at a global scale. In order to catch every opportunity in time, we had to build infrastructure that handled about 20 million requests per second and responded to them in milliseconds. Invite Media was later acquired by Google.
But it wasn’t until we founded our next startup, Arbor, in 2014 that the seeds of what would later become Estuary were planted. Described as “the first marketplace for people-based data,” Arbor allowed groups with access to people-based data (such as publishers and app developers) to connect with those seeking data (such as marketers), to mutual benefit. The company grew quickly and was acquired by LiveRamp.
Both marketing and adtech industries are well-known for accumulating huge volumes of data and needing to leverage it rapidly. By this point, the pain points that came along with these types of workflows had become obvious to us. Namely:
- Companies often had data in tons of different systems, which were extremely hard to reconcile.
- Disparate systems and pipelines produced inconsistent views of what should have been the same data.
- Lots of staff, time, and resources were devoted to making these systems connect and agree… With varying degrees of success.
To avoid these issues at Arbor and later Liveramp, we built Gazette, a modern, flexible, and scalable streaming broker. Gazette stores data as regular files in cloud storage, which makes it unique among similar platforms. Perhaps more importantly, it provides a more time-efficient, low-stress solution for the small team.
An early hire at Arbor perhaps put it best: “Gazette makes you fearless.” With Gazette, a new application can backfill over petabytes of data without putting production at risk, as those reads serve from files in cloud storage rather than proxying through (and possibly knocking over!) production brokers. You never need to worry about running out of disk space. Gazette datasets are easy to integrate with systems that understand files in cloud storage, the lingua franca of modern data architectures.
With Gazette implemented, a single hands-off engineer managed a reliable data streaming service that averaged 7 GB of data per second. Compared to other streaming brokers, this was a much more operationally efficient real-time data backbone, and a truly viable solution for a small startup.
After the Arbor exit, Gazette was open-sourced, but much of its potential was untapped. As a highly technical tool, the barrier to entry was still pretty high. Simply offering Gazette as an alternative to other mature streaming brokers — in other words, aligning with the current streaming paradigm — didn’t seem to be what the industry needed. What needed to happen would be a paradigm shift.
Around this time, the concept of an integrated modern data stack was gaining momentum. Its conceptual framework offered a glimpse into an appealing future for data engineering: one where a set of tools could handle the doldrums of integrations, allowing engineers and analysts to focus their time on delivering value.
But the conceptual framework we know as the “modern data stack” doesn’t address low-latency use-cases. Instead, companies still maintain separate (and less user-friendly) infrastructure for these use-cases. What if Gazette was baked into a highly usable, adaptable platform and managed service? Like other tools in the modern data stack, this platform would enable companies to shift their focus to solving data problems rather than building infrastructure. But it could do so regardless of the latency and temporal requirements of those data problems, making it the first of its kind. By meeting the needs of business analysts and engineers alike, it could truly operationalize real-time data and challenge preconceived notions about its difficulty.
And so, we founded Estuary Technologies, and our product, Flow, was born. Flow is the runtime and managed service built on top of Gazette, and is currently in private beta. Our ecosystem of open-source connectors — which can be used within Flow or separately — is available and features frequent new additions. Our team is rapidly growing, with two regional offices and a few additional remote workers.
For more on our backstory and funding, check out our feature in VentureBeat.
Estuary’s guiding principles
We’ve been guided this far by overarching themes that became apparent us years ago, and ring especially true in today’s data infrastructure landscape. Here are a few of them:
The future of data is real-time, and we’re due for a paradigm shift
We’re stuck in an old status quo that many data professionals and business leaders accept. It goes something like this: Real-time data would be great, but it’s not practical. Batch data is easy.
It’s absolutely true that smart, powerful batch solutions have proliferated and are often easy to implement and manage. But this doesn’t mean that we should write off real-time data infrastructure as a hard problem that will be solved in some vague future time. We have the capabilities now; it’s simply a matter of distributing the right tooling.
Like any successful paradigm shift, the driving force behind this one will be the advent of systems that make it financially and operationally more viable than the incumbents. In the case of real-time, those systems are already up-and-coming. We’ve seen a rise in data use-cases that demand instantaneous feedback, personalization, advanced automation, and cyber-security in data systems. All of these require robust, highly scalable real-time data infrastructure. Once this infrastructure is accessible to all organizations, the friction we see today will be removed, real-time will shift to become the more viable data option, and more use-cases will proliferate.
After all, nobody wants slower data. We’ve simply settled for it.
Great tools combine usability for business analysts and flexibility for engineers
A pervasive problem faced by data-driven organizations is the rift between those who use data (analysts and business leaders) and those who design, manage, and maintain it (engineers). This has to do with job function, sure, but it’s also related to tooling.
Many tools currently on the market have great, user-friendly interfaces that empower business professionals to leverage data, but frustrate engineers by denying them the space to re-design under the hood. Other tools provide the flexibility engineers need to do their job well, but are confusing and inaccessible to less technical users.
Especially if we want to prevent the data landscape from fragmenting even further, we should provide tools that meet the needs of both user groups.
We’ve seen dbt do this for data transformation. Our goal for Flow is to do the same for data integration and operations. This means providing a quality UI as well as a practical back-end entry point.
Consolidation is the new specialization
We’re in the midst of a huge proliferation of increasingly specialized data tools. A lot of them do really cool things for specific use-cases. However, this means that unifying your data stack or pipeline is becoming harder than ever, which leads to the creation of more niche tools, and the cycle continues.
That’s why unification is a central tenet of our work. We’re not here to replace every element of your data stack and claim we can do everything you’ll ever need from your data. What we are here to do is make sure that everything you need is connected to everything else.
We know that change is inevitable — both for individual customers and the broader industry — so we seek to provide a real-time data backbone that can handle whatever the future throws its way without compromising millisecond latency.
Plus, an efficient solution with fewer moving parts saves you money and effort.
Community & further reading
This is just the beginning. Want to get involved in the conversation and stay up to date on product announcements and new connectors? Use these resources.