In technology (as in life) being really good at one thing often comes at the expense of being really bad at something else.
Google Cloud Firestore is no exception.
Many engineering teams have grown to love Firestore’s flexible data model and robust querying capability. They use the database to build wonderfully responsive and efficient apps.
Then, an analyst comes along and wants to aggregate and analyze data in Firestore. It quickly becomes clear that the whole team is between a rock and a hard place.
Large-scale analysis of Firestore data is hard. So is getting data out of Firestore. So, what to do?
The gold standard is to find a solution that:
- Gets that data out of Firestore (for most use cases).
- Doesn’t break the bank by repeatedly reading the entire database.
- Is not a headache and a half.
The reality of this is complex, but we can get pretty close. Let’s break down some options.
Google Cloud Firestore features and limitations
We’ll begin by reviewing where the problem originates: Firestore’s unique features and limitations. These are inextricably linked.
Firestore is a hosted NoSQL database that’s part of the Google Firebase suite. In other words, it’s not constrained to the row-and-column structure of a typical relational database.
It’s also a document-oriented database. This sets it apart from its counterpart in the Firebase suite, Google Real-Time Database.
- Firestore has a unique data model that allows complex document nesting and robust querying at high speeds. This makes it a great choice for powering cutting-edge applications. The Firestore database backing a given application is often large, intricate, and uniquely structured.
- Google real-time database has no documents — each database instance is basically one giant JSON tree. Because of this, it’s quite easy to export that JSON from the console. For a given project, you’ll typically have many small databases rather than one large one (for example, each customer has their own database).
As a document-oriented NoSQL database, Firestore might be more aptly compared to MongoDB. These two model data differently, however. We won’t get into detail, but the upshot is that MongoDB is considered better-suited to act as the central transactional database for large products. Firestore, on the other hand, is better for quick and lightweight applications, like chat applications.
Now, a brief refresher on the Firestore data model…
Intro to the Cloud Firestore data model
Firestore’s claim to fame is the flexible hierarchy of its data structure. It supports complex nesting and robust querying.
The main unit of storage is a document. It’s a schemaless, lightweight set of key-value pairs — basically-but-not-quite JSON, according to Google.
For example, a single document might look like:
user_1 name : “alice” age : 37 country : “USA”
Documents are contained in collections. Collections can be nested. Documents can also be nested. Documents can even come with subcollections, collections nested underneath them in the hierarchy.
For example, a collection
users with two documents that each contain a subcollection called
messages might look like:
users user_1 name : “alice” age : 37 country : “USA” messages 1676604320 : “hello” 1676615143 : “where am I…?” user_2 name : “bob” age : 50 country : “USA” messages 1676626329 : “i love firestore!”
Once again, it’s all about flexibility. Read more in Google’s docs.
Using Google Analytics for Firestore data
Given the flexibility we just discussed, you can see why it’s hard to analyze data in Firestore.
However, there are some cases where it’s actually really easy.
- The Firestore data you want to analyze has to do with the usage of a particular app in Firebase.
- You specifically want to monitor app usage and user behavior through event tracking and audience segmentation.
…then you can simply use Google Analytics!
Google Analytics is baked into Firebase. You simply add the Analytics SDK to the app in question, and configure the events and user properties of interest. A modest amount of extensibility is also available with Google Cloud Functions.
This is a roundabout solution (you’re not directly analyzing the data in Firestore), but in certain cases that doesn’t matter.
If this sounds like it’ll work for you, instructions from Google are here.
If not, we must move on from the tooling Google provides. Unfortunately, Google doesn’t make it terribly easy to escape its ecosystem.
How to aggregate Firestore data
Firestore is quite adept at individual queries, even complex ones, but it’s nearly impossible to analyze large datasets there. That’s because Firestore’s support for aggregation is pretty minimal.
Aggregation is imperative for doing analytics right. You need to be running your analytical workloads against a storage system designed with aggregation in mind — ideally, one that allows you to easily aggregate and explore data across many dimensions. While Firestore’s aggregation functions might be useful for one-off queries, you can’t build a serious analysis off it.
The best way to prepare Firestore data for aggregation and analysis is to get your data out of Firestore.
There’s no way around this if you care about creating robust analytical models, machine learning pipelines, data science projects, and the like.
Even if your goal is just to use a business intelligence tool that’s not Google Analytics, the first thing you need to do is migrate your Firestore data to a different storage platform that can integrate with that tool.
Here are some options:
1— Export Firestore data to CSV or JSON
There are various free and paid tools available that make it a cinch to export Firestore databases as CSV or JSON.
You can also perform a migration by writing your own Google Cloud function and using a package like json2csv.
From there, you have a common, easy-to-parse data type representing your Firestore database to analyze as you like.
Advantages of exporting Firestore to CSV or JSON:
- Well-suited for one-off exports.
- Flexibility to do whatever you want with the data, and choose between free and paid tools.
- CSV is a popular format for data science; JSON is relatively easy for developers to parse and load into other storage systems.
- Some of the tools above also come with a GUI for basic data exploration and manipulation that might be able to meet your needs.
Disadvantages of exporting Firestore to CSV or JSON:
- In many cases CSV or JSON is not your desired final format.
- Parsing output from more complex Firestore databases could get hairy.
- If you want to keep the data in analytical storage (read: a data warehouse), which is a best practice, you have to get it there on your own.
- If you want to keep an up-to-date data record, this gets both time-consuming and expensive, for the same reason as using the managed export service, which I’ll discuss next.
2— Use the managed Firestore Export Service
The only path Google actually provides to move data out of Firestore is with the managed Firestore Export Service. You can choose to export specific collections or all documents in the database.
The export comes in leveldb log format, an open-source key/value log format created by a Google employee and commonly used by Google.
Now, let’s talk cost (and note this applies to the CSV and JSON export methods above, or ANY method that must read the entire dataset for each export). As Google’s documentation states, Firestore charges by data read: $0.06 per 100,000 documents, to be exact.
This might not seem like much, but it adds up quick, especially if you have to export your whole database regularly to keep an up-to-date copy of data for analysis!
Advantages of the managed Firestore Export Service
- Full support from Google.
- Easy to narrow down certain collections to export.
- If you’re moving data to BigQuery, there’s an easy path to do so.
Disadvantages of the managed Firestore Export Service
- Repeated exports to keep data timely will get expensive quickly.
- Leveldb logs aren’t exactly a popular data format, so it’s hard to move the data to most other systems.
3— Use Change Data Capture
Our final option is to use a change data capture (CDC) solution.
We usually think of CDC from traditional relational databases (Postgres, MySQL, etc). These systems have some sort of change logging capability you can tap into and isolate change events (ie, the Postgres WAL and MySQL binlog).
But at its core, all CDC really means is isolating change events in a source system so a downstream system can react, usually by updating a target.
In other words, we’re talking about a system that listens for data changes in Firestore and reads only changed documents. The advantages of this method are simple but key.
Advantages of Firestore CDC:
- Save lots of money on Firestore read operations, because after the initial backfill, only changed data is read.
- Data changes are captured from Firestore to the downstream analytical storage system or platform extremely quickly, so your analytical models and BI tools always reflect the latest data.
Disadvantages of Firestore CDC:
- Hard to build.
As a person who has read blogs on the internet before, what I say next probably won’t surprise you (I’m going to, ahem, recommend my company’s product) but just stick with me.
Vendors that help you to set up CDC pipelines have proliferated in recent years, and CDC has become an industry standard. You’ll have quite a few options if you go looking for CDC solutions for a big-name RDBMS.
Platforms that support CDC pipelines from Firestore, on the other hand, are less common. This is probably because Firestore is…
- Less mature
- More niche
…than those big-name RDBMS. But that doesn’t mean your Firestore data should have to sit in a silo, or be restricted to outdated and costly bulk export operations.
That’s exactly why Estuary decided to create a Firestore CDC connector as part of our real-time data platform, Flow, which you can try for free here.
Moving data from Google Cloud Firestore to a data warehouse for analysis
After connecting to your Firestore database, you can use a CDC tool like Flow to push that data to all sorts of destinations. But there are two that are arguably the most important.
They’re two of the most popular and powerful data warehouses today:
- Google BigQuery
Why is it important to connect Firestore to a data warehouse in real time?
Simply put, the data warehouse acts as the single source of truth for analytical workflows in the data stack.
You can hook up all of your team’s analytical tools to the warehouse, so everyone (data scientists, dbt users analytics engineers, BI dashboards, and SaaS apps) is working off the same data. And that data should be updated within down-to-the-second (or millisecond) accuracy if you don’t want outdated analytics.
Unlike transactional systems (like Firstore), data warehouses are designed to handle the rigors of numerous analytical workflows.
Plus, this is a way more efficient architecture than setting up a pipeline from Firestore straight to every end-user application related to analysis.
Ultimately, a real-time Firestore to data warehouse pipeline creates:
- Less engineering work
- Less cross-department frustration
- Timely and accurate business outcomes
Firestore to BigQuery is a more common use case, and is supported by various pipeline vendors and Google itself (though not usually using CDC).
But I’m not aware of any other company currently offering real-time Firestore to Snowflake integrations. Full blog on that subject here.
If you’d like to see an integration with another data warehouse or storage platform, leave a comment below! We prioritize our development based on demand.
How the Estuary Flow Firestore capture connector works
For my discerning engineer readers who are new here (👋), you’re probably wondering:
Ok, what’s the deal with this “real-time data platform” anyway? And how does Firestore CDC work?
The quick rundown:
- Flow is a managed platform with a GUI and a CLI. It allows you to move data from wherever it is to wherever you want.
- Flow has an event-based runtime called Gazette (not Kafka), allowing real-time pipelines.
- Our real-time connectors are open-source and built by our team. We also support Airbyte connectors.
- The Firestore connector works by performing a historical backfill and then asking Firestore to alert it of change events.
I invite you to take a look at:
You can sign up for Flow’s free tier here or reach the engineering team on Slack with any questions.
Google Cloud Firestore is a NoSQL database whose unique data model makes it great for transactional workloads backing lightweight apps, but not so great for analysis. It also doesn’t natively provide many options for export.
If your analytical needs are basic, Google Firebase’s built-in Google Analytics integration might be all you need.
On the other hand, if you need to use the data in your Firebase instance to power analytical workflows across an organization, you should move the data out of Firestore and into a data warehouse.
To do that, you’ll need to either:
- Repeatedly export data
- Use a CDC pipeline
No matter your use case, there’s likely a solution out there well suited for you. Hopefully this gives you enough food for thought to get that process going.
Have other considerations on Firestore analysis I didn’t mention here? Feel free to comment below or hit up the team on Slack!