Data Featured

Comparing Data Models in CosmosDB

I'm going to take you through my thought process and compare two different data models using the same data a few simple experiments I ran to compare the operational costs associated with the two different models.

Emil Bsaibes

10 Jul 2024 • 6 min read

Photo by Markus Spiske / Unsplash

Last year I worked on a data streaming and ingestion project bringing in various data assets into the cloud in near real time. Our target date store was Azure CosmosDB. This was my first real experience with no-sql and CosmosDB in particular. It was quite a mind shift from more traditional relation datastores I've worked with in the past. The data modeling in particular is different since you're not confined to strict tables of rows and columns. Instead the data model is loose, each record is a document, typically json. There is no strict format. Your document could represent one record, or a collection of records, or multiple collections of records. Although you can organize the data how you want you really should think about how the data will be used by the applications consuming the data.

I'm going to take you through my thought process and compare two different data models using hte same data a few simple experiements I ran to compare the operational costs associated with the two different models.

The Context

Let's say I wanted to store some stock data specifically annual balance sheets used for maybe plotting and studying a company's performance. I'd periodically update it to get the latest balance sheet. And since the data doesn't change I'd don't want to keep calling an API to get the data instead I'll just call an API once and store the data for use later.

The API returns a large JSON response containing several years worth of balance sheets like below:

{
    "symbol": "IBM",
    "annualReports": [
        {
            "fiscalDateEnding": "2023-12-31",
            ...
            "commonStockSharesOutstanding": "915013646"
        },
        {
            "fiscalDateEnding": "2022-12-31",
            ...
            "commonStockSharesOutstanding": "906091977"
        },
        {
            "fiscalDateEnding": "2021-12-31",
            ...
            "commonStockSharesOutstanding": "898068600"
        },
    ]
}

You can see the symbol, but within the json object there is a list annualReports which contains several records.

My super simple use case is: I want to plot some data for some given fields over the years that I have. In a larger project you might have several but It's easiest to focus on a few of the largest use cases.

The response looks easy enough to use, so I can store the response as-is in a CosmosDB container as a single document. OR the other option is to extract the individual annual reports contained in the list and store them as invidiual documents. The second option would seems familiar because that's how'd we'd store it in a relational database, each year would be stored be a distinct row in a table. So my two cases are:

1 - Giant Document (As-is)
OR
2 - Split into the data into smaller documents. 1 annual report per document.

Reference data: Alpha Vantage Demo - IBM Balance Sheet

Experiments

Any experiment has to start with a question to answer. The question I'm trying to answer is "What is the most efficient way to store this data given that the number of records are the same?" CosmosDB has a nice way to compare operations by distilling them down to some number of request units (RUs). Microsoft then changes based on the RUs you used per month. So in my experiments my measure of efficiency is RUs. I'll execute the operations against each data model (Giant or small documents) and compare the RU cost. The lower RU cost is better.

The operations I'm considering are read, write, update, delete*. on the original structure seen above against the similar operations against a revised structure where each subitem in the annualReports list is an invidual document.

An example of a single balance sheet document.

{
    "fiscalDateEnding": "2023-12-31",
    ...
    "partitionKey": "IBM_annual_balance_sheet",
    "id": "2023-12-31",
}

For a valid comparison I must compare the total cost of multiple operations in the individual documents model, versus one operation in the giant document model.
Also I am running my code against the CosmosDB emulator, but you could easily run it against the actual service.

Trial 1 - Writing the data

As you might expect writing the giant document is expensive. However, it still costs about ~36% less than writing all the individual documents.

Single Giant Document Write: 132.19 RUs

Sum of Multiple Smaller Writes: 207.199 RUs

Trial 2 - Reading the data

Reading the data is interesting because there are various ways you can read the data in CosmosDB. One way is to query the data using the SQL API or if you know the Id and partitionKey you can execute a point-read which is a very specific query to get the data most efficiently.

Remember my goal is to get all the data I need to create the a plot as efficiently as possible. Since the "Giant" document is easily identified, I can get all the data I need in a single point read.

I'm not so lucky when requesting the data from the invidual documents. In this case I'll need to request all the documents using a query.

Point Read (Giant) = 2.19 RUs

Query all Individual docs = 3.69 RUs

So far it makes the most sense to keep the data as-is rather than splitting it up into individual documents.

Trial 3 - Updating the Data

In both cases I chose to update a single attribute of document to keep the comparison the same.

In the Giant document case updating a single attribute is near 40% more expensive than updating a single attribute in the indiviudal document. Why might this be? Well in my case I'm using upsert_item from the python sdk for both documents, which means it's replacing the original document. There is patch_item which might be more performant but I didn't try at the time of writting.

Update the Giant Document = 23.24 RUs

Update an indvidual document = 13.71 RUs

This use case is trivial and doesn't fully represent the an actual use case. In reality when updating in the Giant document model, the giant document would be replaced with a new document containing all the sub documents. This would only happen once a year since I'm looking at annual reports.

In the individual case, the new annual report is a new document to be stored in the container.

Replace the Giant Document: 23.24 RUs (same as above)

Add a new individual document (Insert Single Document):  12.95 RU

It would be more efficient from this perspective to insert just single document to the container.

Trial 4 Deletion

Most likely you'd probably want to keep the some X number of years of balance sheets. You'd want to consider what it costs to store the data vs deleting it. Typically storage is pretty cheap so the cost of storing extra data might not be a concern. BUT let's say you want to strictly maintain only the most recent 10 years. We have to consider the initial load scenerio and the handling the subsequent years.

Initial (Bulk) Load

This is the first time you load all the data into the CosmosDB. We have to do this for both models.

During the initial load no deletes are actually happening. As the data is processed we can determine what to save and what to ignore as we process. This is the same for both models. The situation is slightly different after the bulk load.

Subsequent years

Here's the scenario: My initial data is loaded, and a year has past, so I need to remove the oldest year.

With the giant document, the operations will look like this.

Request the latest collection of balance sheets from the external API.
Modify the original response, removing the records which aren't needed
Upsert the modified document operation replacing the original document.

This will look like a single update operation. There is no seperate delete operation.

In the individual document we can get away with no delete operation as well, if we take advantage of document ttl (time-to-live). The TTL will tell CosmosDB to delete the document after the TTL expires. At the time of writing this post, the max value of TTL is 2147483647 seconds, which is a little over 68 years. As long as we set the oldest document's ttl to 1 year, and a newly inserted document to 10 years, and the pattern is continued with each new document then we shouldn't need a delete operation.

Concluding Thoughts

Although this use-case is trivial it illustrates the considerations needed to take when considering how to structure data in CosmosDB. Given these two options I'd end up keeping the response as-is since it's slightly more efficient from the database perspective. Thinking through a couple of data maintance operations like update and delete seem pretty straightfoward from a programming perspective as well. Creating a proof of concept, testing it out on a serverless instance, and observing the RUs over time can help you better understand your consumption cost.

Not mentioned in this post is the use of indices. By default CosmosDB indexes each attribute in the document, but this can be modified to only index attributes you care about in your query. Reducing the indices to only what you need helps to reduce the RU costs. It's another avenue of fine-tuning your data model and configuration.