In distributed databases, replication is mandatory for availability and low latency. That replication introduces a fundamental tradeoff: how fresh your reads are versus how available and fast the system stays. The PACELC theorem captures this:
- On partition (
P), choose Availability vs Consistency (A/C). - Else (
E, normal operation), choose Latency vs Consistency (L/C).
Azure Cosmos DB exposes these tradeoffs explicitly via five consistency levels, from Strong to Eventual, instead of forcing you into a single strong-or-eventual choice.
In microservice architectures, these guarantees often sit behind service-to-service APIs (frequently implemented with gRPC); for transport and contract tradeoffs, see Transitioning from REST to gRPC: System Design and Tradeoffs.
Consistency levels in Azure Cosmos DB
Cosmos DB supports five consistency levels (strongest to weakest):
| Level | Guarantee summary |
|---|---|
| Strong | Linearizable reads; always see the latest committed write. |
| Bounded staleness | Reads lag behind writes by at most K versions or T time. |
| Session | Per-session read-your-writes and write-follows-reads. |
| Consistent prefix | Writes are seen in order; no out-of-order reads. |
| Eventual | Replicas converge eventually; no ordering guarantees. |
The spectrum from Strong → Eventual trades some consistency for higher availability, lower latency, and higher throughput.
Strong consistency
Strong consistency gives linearizability:
- Every read returns the most recent committed write or an error.
- Clients never see partial or uncommitted writes.
- All replicas appear to move forward in a single global order.
Operationally:
- Writes must be replicated and committed across regions before they are visible.
- This increases write latency and can reduce availability during failures (if some replicas cannot commit).
Use Strong when:
- Business logic cannot tolerate stale reads:
- Financial balances.
- Ledger-style data.
- Critical configuration and control-plane state.
You pay in latency and availability to get simple, always-correct semantics.
Bounded staleness
Bounded staleness caps how stale reads can be between regions:
- Staleness is defined either as:
- A maximum number of versions
Kof an item, or - A maximum time interval
Tby which reads may lag writes.
- A maximum number of versions
- Cosmos DB ensures lag between any two regions stays less than your configured
KorT.
Single-region write accounts
For a single write region with read regions:
- Writes occur in the primary region.
- Replication to secondary regions may lag.
- With bounded staleness:
- If lag exceeds
KorT, writes are slowed until secondary replicas catch up. - This trades some write latency for bounded read freshness across regions.
- If lag exceeds
Multi-region write accounts
With multiple write regions:
- Writes can originate in any region.
- Replication happens between all writable regions.
- Relying on bounded staleness across multiple writers can:
- Introduce complex dependencies on cross-region replication lag.
- Violate the expectation that you should generally read from the same region you wrote to.
For most multi-write scenarios, bounded staleness is less attractive than region-local reads plus higher-level conflict resolution.
Why bounded staleness is useful
- Lets you configure maximum acceptable staleness (
KorT). - Provides near-strong behavior for global apps where:
- Users in different regions should see approximately the same view.
- Some small, controlled delay is acceptable.
Session consistency
Session consistency targets user-centric scenarios while staying highly available.
Guarantees within a single client session:
- Read-your-writes: if the client writes a value, it can later read it back.
- Write-follows-reads: writes that depend on earlier reads see a consistent base.
Outside that session:
- Other clients may see slightly stale data.
- The system behaves more like eventual/consistent-prefix for them.
Role of session tokens
- After each write, the server returns a session token stamped with the latest state for a partition.
- The client caches this token and sends it on future reads.
- The server ensures returned data is at least as fresh as indicated by the token; otherwise, it:
- Routes the read to another replica, or
- Waits until the replica catches up.
Important details:
- Tokens are per-partition; a token for partition
Adoes not apply to partitionB. - Recreating a client resets its token cache; until new writes occur in that session, reads behave like Eventual for that client.
Session consistency is ideal when application behavior is tied to a user session (shopping carts, user profiles, dashboards) and the user must see their own latest actions immediately.
Consistent prefix
Consistent prefix guarantees no out-of-order reads:
- If writes occur in order
w1, w2, w3, a reader may see:[],[w1],[w1, w2], or[w1, w2, w3]- But never
[w2]or[w1, w3]withoutw2.
Behavior:
- Single-document writes are eventually consistent, but still appear in correct order.
- Batch writes within a transaction:
- Are seen as a unit: all updates from the transaction appear together or not at all.
Replication pattern:
- Writes are replicated to at least three replicas in the local region.
- Other regions receive updates asynchronously.
Consistent prefix is a good fit when ordering matters more than absolute freshness, such as:
- Timelines or feeds.
- Append-only logs where readers can tolerate some delay but must not see reordered entries.
Eventual consistency
Eventual consistency is the weakest model Cosmos DB offers:
- If no new writes are made to a data item, eventually all replicas will return the last written value.
- There is no guarantee about:
- How long convergence takes.
- Whether a client might momentarily read older values than it saw previously.
Mechanics:
- In Cosmos DB, each write is replicated to at least three replicas in the local region (for durability).
- Replication to other regions is asynchronous.
- Reads can hit any replica in a region:
- That replica may be behind and return stale or missing data.
Why use it:
- Maximizes availability and performance, especially under high load or partial failures.
- Suitable when minor staleness is acceptable:
- Social reaction counts.
- Analytics aggregation.
- Non-critical telemetry.
Avoid Eventual for flows that cannot tolerate stale or out-of-order reads, such as financial balances or inventory source of truth.
Probabilistically Bounded Staleness (PBS)
Cosmos DB exposes a Probabilistically Bounded Staleness (PBS) metric to quantify “how eventual” your eventual consistency is:
- PBS answers: with what probability are reads at least as fresh as X, given your workload and topology?
- Measured over time (
ms) and shown per write/read region combination in the Azure portal.
Practical use:
- You can run at a weaker configured consistency (e.g., Eventual) while:
- Observing that in practice, you often get Session or Consistent Prefix behavior.
- Tuning your application expectations accordingly without incurring the full cost of a stronger configured level.
PBS makes consistency behavior observable instead of purely theoretical.
Use cases by consistency level
| Level | Typical use cases |
|---|---|
| Strong | Banking and financial transactions; order management where double-spend is unacceptable; any system-of-record requiring globally up-to-date reads. |
| Bounded staleness | Global content where near-real-time is enough (news, status dashboards) and you want a clear upper bound on staleness. |
| Session | User-centric apps (shopping carts, profiles, social timelines) where a user must see their own writes immediately. |
| Consistent prefix | Messaging or event logs where ordering is critical but slight delays are acceptable. |
| Eventual | Analytics, counters, social reactions, and non-critical metadata where throughput and availability trump strict freshness. |
Architecture view: routing reads and writes with different consistencies
graph TD
subgraph ClientSide ["Clients and Regions"]
U1["User A (Region 1)"] -->|"Session consistency"| R1["Region 1 Replica Set"]
U2["User B (Region 2)"] -->|"Eventual consistency"| R2["Region 2 Replica Set"]
end
subgraph CosmosAccount ["Cosmos DB Account"]
R1 --> P[Primary Partition]
R2 --> P
end
P --> METRICS[("(PBS Metrics)")]
style ClientSide fill:#111,stroke:#333,stroke-dasharray: 5 5
style CosmosAccount fill:#111,stroke:#333,stroke-dasharray: 5 5
style P fill:#111,stroke:#4ADE80,stroke-width:2px
P --> METRICS[(PBS Metrics)]
- User A in Region 1 uses Session consistency for interactive flows.
- User B in Region 2 uses Eventual consistency for background analytics.
- Both hit the same logical partition; the system tracks behavior via PBS metrics.
Key takeaways
- PACELC frames the tradeoff: on partition, choose
AvsC; else, chooseLvsC. Cosmos DB’s five levels map these choices directly into the API. - Strong gives the simplest programming model but costs in latency and availability.
- Bounded staleness gives configurable, near-strong behavior for global apps.
- Session is often the sweet spot for user-facing workloads: strong per user, scalable system-wide.
- Consistent prefix preserves order without demanding fully fresh data.
- Eventual maximizes availability and throughput when your domain can tolerate some inconsistency.
// SPONSORSHIP
If this research saved you time or improved your architecture, consider sponsoring my work on GitHub. All sponsorships go directly toward infrastructure and further technical research.
[ Become a Sponsor ]