James Duggan

CosmosDb and CRUD with a S

In preparation for my talk on Azure SQL Database vs. CosmosDb - How do you choose at giving a talk at sqlbits I spent a lot of time thinking about create, read, update and delete (CRUD) operations, and concluded that with the advent of CosmosDb and its multi-model abstraction layer, CRUD needs a S.

CRUD as an acronym has served software engineering very well by encapsulating the core actions of data persistence and retrieval. In the transition from the “ye olde” closed and monolithic systems to the current open, loosely coupled and distributed micro-service architectures via client server and service orientate architecture (SOA), the focus around CRUD operations has been towards highly efficient operations that are simple, isolated and easy to test, and away from complex, inter-dependant, and more often than not, hard to test and inefficient.

CRUD as an acronym has severed us well until now.

A little bity of history

Being neither a historian nor an academic I will probably get my facts wrong here, but essentially what CosmosDb represents as an abstraction. Initially there was DocumentDb as a nosql storage, with a lightweight SQL implementaiton, and fundamentally, a scale-out database abstraction. This then evolved into the globally distributed, partitioned, multi-model abstraction layer called CosmosDb with DocumentDb renamed to SQL API and relegated to model status with peers such as Graph, Table, MongoDb (and now Cassandra), essentially storage.

Scale-up vs. scale-out

Database scale-up is easy. It follows the (simplified) cycle of monitor, understand application behaviour, optimise, add more resources such as memory, cpu, storage, and repeat.

Database scale-out, also known as database sharding or in the CosmosDb vernacular, partitioning, is not easy, simply because there are many databases. Identifying the correct shard key to distribute data, managing skew and limiting data operations to a single database rather than fanout to many databases, must be considered in advance. This requires upfront understanding of the application behaviour, all the read, and all the write patterns. How does the application know where to write data and retrieve it at a later point in time? The answer is it doesn’t, and so a location database (shard map manager) is required, which in turn implicitly requires the application knows the shard key. Apart from increased design effort to ensure good data distribution, deployment is harder, as is the increased maintenance overhead, for example, adding or removing databases.

Why does this matter?

The Azure PasS SQL Database offering is a scale-up abstraction. In the on-premise world of SQL Server, adding more resources is nearly always the blocker for the scale-up scenario described above. Not so for SQL Database where scale up is turnkey. Moreover, in abstracting scale-up, CRUD continued to work well as the rules were still the same.

In a similar way, the Azure PaaS CosmosDb offering is a scale-out abstraction with the added allure of automatic indexing. It takes away much, if not all, of the management and deployment pain. However, the rules have changed, and this is not immediately apparent.

Abstraction is a good thing

We benefit from abstraction all the time and I think the following excerpt from the azure sql database technical overview while specific to Azure SQL Database, captures the essence of a good abstraction

These capabilities allow you to focus on rapid app development and accelerating your time to market, rather than allocating precious time and resources to managing virtual machines and infrastructure

Interestingly, we deal with the effects of over dependence on an abstraction or lack in understanding of what is being abstracted, more often than we realise. Think for a moment about the separation of code from data by object-relational mapping (ORM), which overtime resulted in software engineers with close to zero understanding of databases wreaking havoc on DBA teams.

CRUD needs an S

History lesson out of the way, let’s get back to my point.

Since sharded databases are not common place, I assert that when a software engineer approaches CosmosDb and thinks about CRUD operations, they are mostly thinking about a single database, which is the most common implementation. However, since CosmosDb is distributed and partitioned (sharded), there are some rules and both data and software engineers need to approach this very differently and be mindful that not everyone understands sharding. Previously an application started out small, scaled-up and perhaps scaled-out, or was designed with scale-out in mind from the outset. In both cases, at some point in time an engineer arrived at the conclusion that scale-out was required and thought about how do it properly (at least I hope they did).

It’s not about pluralising

Though possibly a useful way to think in terms of data being distributed across many databases, adding an S is not about pluralising. The SQL API (formerly DocumentDb) .NET driver supports the following operations, create, read, upsert, replace, delete & query. The create, replace and upsert operations require the partition key, which can be inferred from the json document. The read and delete operations also require the partition key and this must be specified through the operation’s RequestOption argument, since neither operation support a json document argument. So far, partition key notwithstanding, CRUD holds up.

What then is the S all about? Through the query operator the SQL API supports lightweight SQL statements, namely select, Furthermore, the query operator implements different behaviour by allowing the partition key to be optional. If not specified then the query is a fanout one, which may have considerable performance and therefore cost implications depending on the query frequency and the volume of documents returned. For this reason select statements or scenarios intending to take advantage of automatic indexing should be given due consideration early in the design phase. At this point it is worth mentioning the query operator goes hand-in-hand with automatic indexing and a query filtering by a non indexed column will fail.

In Summary

I repeatedly say that cross partition query is an anti-pattern to scale-out. Since I doubt CRUDS will ever take off as an acronym (I can live with this), it is more important the reader understands the performance and therefore financial cost of cross partition queries, and mitigate wherever possible. In general I find that once understood, architects and engineers will rework their solution to avoid costly fanout scenarios.

I think is it important to clarify there is nothing new here per se. Read up on sharding be it for Azure SQL Database or MongoDb, and in all cases the perils of fanout are highlighted. What is different now, is that with CosmosDb the complexities of sharding are abstracted away, and as engineers we have to up our game.

Twitter

comments powered by Disqus

Contact Us

For enquiries please contact:

e: hello@sabin.io

t: 0208 1234 023