Consistency Models in Azure Cosmos DB: From Strong to Eventual

In distributed databases, replication is mandatory for availability and low latency. That replication introduces a fundamental tradeoff: how fresh your reads are versus how available and fast the system stays. The PACELC theorem captures this:

  • On partition (P), choose Availability vs Consistency (A/C).
  • Else (E, normal operation), choose Latency vs Consistency (L/C).

Azure Cosmos DB exposes these tradeoffs explicitly via five consistency levels, from Strong to Eventual, instead of forcing you into a single strong-or-eventual choice.

In microservice architectures, these guarantees often sit behind service-to-service APIs (frequently implemented with gRPC); for transport and contract tradeoffs, see Transitioning from REST to gRPC: System Design and Tradeoffs.


Consistency levels in Azure Cosmos DB

Cosmos DB supports five consistency levels (strongest to weakest):

LevelGuarantee summary
StrongLinearizable reads; always see the latest committed write.
Bounded stalenessReads lag behind writes by at most K versions or T time.
SessionPer-session read-your-writes and write-follows-reads.
Consistent prefixWrites are seen in order; no out-of-order reads.
EventualReplicas converge eventually; no ordering guarantees.

The spectrum from Strong → Eventual trades some consistency for higher availability, lower latency, and higher throughput.

Azure Cosmos DB consistency spectrum from Strong through Bounded Staleness, Session, Consistent Prefix, to Eventual.
Moving right from Strong to Eventual consistency increases availability, lowers latency, and improves throughput at the cost of read freshness.

Strong consistency

Strong consistency gives linearizability:

  • Every read returns the most recent committed write or an error.
  • Clients never see partial or uncommitted writes.
  • All replicas appear to move forward in a single global order.

Operationally:

  • Writes must be replicated and committed across regions before they are visible.
  • This increases write latency and can reduce availability during failures (if some replicas cannot commit).

Use Strong when:

  • Business logic cannot tolerate stale reads:
    • Financial balances.
    • Ledger-style data.
    • Critical configuration and control-plane state.

You pay in latency and availability to get simple, always-correct semantics.


Bounded staleness

Bounded staleness caps how stale reads can be between regions:

  • Staleness is defined either as:
    • A maximum number of versions K of an item, or
    • A maximum time interval T by which reads may lag writes.
  • Cosmos DB ensures lag between any two regions stays less than your configured K or T.

Single-region write accounts

For a single write region with read regions:

  • Writes occur in the primary region.
  • Replication to secondary regions may lag.
  • With bounded staleness:
    • If lag exceeds K or T, writes are slowed until secondary replicas catch up.
    • This trades some write latency for bounded read freshness across regions.

Multi-region write accounts

With multiple write regions:

  • Writes can originate in any region.
  • Replication happens between all writable regions.
  • Relying on bounded staleness across multiple writers can:
    • Introduce complex dependencies on cross-region replication lag.
    • Violate the expectation that you should generally read from the same region you wrote to.

For most multi-write scenarios, bounded staleness is less attractive than region-local reads plus higher-level conflict resolution.

Why bounded staleness is useful

  • Lets you configure maximum acceptable staleness (K or T).
  • Provides near-strong behavior for global apps where:
    • Users in different regions should see approximately the same view.
    • Some small, controlled delay is acceptable.

Session consistency

Session consistency targets user-centric scenarios while staying highly available.

Guarantees within a single client session:

  • Read-your-writes: if the client writes a value, it can later read it back.
  • Write-follows-reads: writes that depend on earlier reads see a consistent base.

Outside that session:

  • Other clients may see slightly stale data.
  • The system behaves more like eventual/consistent-prefix for them.

Role of session tokens

  • After each write, the server returns a session token stamped with the latest state for a partition.
  • The client caches this token and sends it on future reads.
  • The server ensures returned data is at least as fresh as indicated by the token; otherwise, it:
    • Routes the read to another replica, or
    • Waits until the replica catches up.

Important details:

  • Tokens are per-partition; a token for partition A does not apply to partition B.
  • Recreating a client resets its token cache; until new writes occur in that session, reads behave like Eventual for that client.

Session consistency is ideal when application behavior is tied to a user session (shopping carts, user profiles, dashboards) and the user must see their own latest actions immediately.


Consistent prefix

Consistent prefix guarantees no out-of-order reads:

  • If writes occur in order w1, w2, w3, a reader may see:
    • [], [w1], [w1, w2], or [w1, w2, w3]
    • But never [w2] or [w1, w3] without w2.

Behavior:

  • Single-document writes are eventually consistent, but still appear in correct order.
  • Batch writes within a transaction:
    • Are seen as a unit: all updates from the transaction appear together or not at all.

Replication pattern:

  • Writes are replicated to at least three replicas in the local region.
  • Other regions receive updates asynchronously.

Consistent prefix is a good fit when ordering matters more than absolute freshness, such as:

  • Timelines or feeds.
  • Append-only logs where readers can tolerate some delay but must not see reordered entries.

Eventual consistency

Eventual consistency is the weakest model Cosmos DB offers:

  • If no new writes are made to a data item, eventually all replicas will return the last written value.
  • There is no guarantee about:
    • How long convergence takes.
    • Whether a client might momentarily read older values than it saw previously.

Mechanics:

  • In Cosmos DB, each write is replicated to at least three replicas in the local region (for durability).
  • Replication to other regions is asynchronous.
  • Reads can hit any replica in a region:
    • That replica may be behind and return stale or missing data.

Why use it:

  • Maximizes availability and performance, especially under high load or partial failures.
  • Suitable when minor staleness is acceptable:
    • Social reaction counts.
    • Analytics aggregation.
    • Non-critical telemetry.

Avoid Eventual for flows that cannot tolerate stale or out-of-order reads, such as financial balances or inventory source of truth.


Probabilistically Bounded Staleness (PBS)

Cosmos DB exposes a Probabilistically Bounded Staleness (PBS) metric to quantify “how eventual” your eventual consistency is:

  • PBS answers: with what probability are reads at least as fresh as X, given your workload and topology?
  • Measured over time (ms) and shown per write/read region combination in the Azure portal.

Practical use:

  • You can run at a weaker configured consistency (e.g., Eventual) while:
    • Observing that in practice, you often get Session or Consistent Prefix behavior.
    • Tuning your application expectations accordingly without incurring the full cost of a stronger configured level.

PBS makes consistency behavior observable instead of purely theoretical.


Use cases by consistency level

LevelTypical use cases
StrongBanking and financial transactions; order management where double-spend is unacceptable; any system-of-record requiring globally up-to-date reads.
Bounded stalenessGlobal content where near-real-time is enough (news, status dashboards) and you want a clear upper bound on staleness.
SessionUser-centric apps (shopping carts, profiles, social timelines) where a user must see their own writes immediately.
Consistent prefixMessaging or event logs where ordering is critical but slight delays are acceptable.
EventualAnalytics, counters, social reactions, and non-critical metadata where throughput and availability trump strict freshness.

Architecture view: routing reads and writes with different consistencies

graph TD
    subgraph ClientSide ["Clients and Regions"]
      U1["User A (Region 1)"] -->|"Session consistency"| R1["Region 1 Replica Set"]
      U2["User B (Region 2)"] -->|"Eventual consistency"| R2["Region 2 Replica Set"]
    end

    subgraph CosmosAccount ["Cosmos DB Account"]
      R1 --> P[Primary Partition]
      R2 --> P
    end

    P --> METRICS[("(PBS Metrics)")]
    
    style ClientSide fill:#111,stroke:#333,stroke-dasharray: 5 5
    style CosmosAccount fill:#111,stroke:#333,stroke-dasharray: 5 5
    style P fill:#111,stroke:#4ADE80,stroke-width:2px

    P --> METRICS[(PBS Metrics)]
  • User A in Region 1 uses Session consistency for interactive flows.
  • User B in Region 2 uses Eventual consistency for background analytics.
  • Both hit the same logical partition; the system tracks behavior via PBS metrics.

Key takeaways

  • PACELC frames the tradeoff: on partition, choose A vs C; else, choose L vs C. Cosmos DB’s five levels map these choices directly into the API.
  • Strong gives the simplest programming model but costs in latency and availability.
  • Bounded staleness gives configurable, near-strong behavior for global apps.
  • Session is often the sweet spot for user-facing workloads: strong per user, scalable system-wide.
  • Consistent prefix preserves order without demanding fully fresh data.
  • Eventual maximizes availability and throughput when your domain can tolerate some inconsistency.

[ RELATED_LOGS ]

TTFB: -- ms LOAD: -- s PAYLOAD: -- kb