Clock synchronization is a nightmare: Why Spanner uses TrueTime and the rest of us suffer

I keep reviewing distributed database architectures where engineers rely on system timestamps to resolve write conflicts. They pull up System.currentTimeMillis(), attach it to a payload, and assume they have a globally accurate source of truth.

I genuinely hate seeing this pattern. It shows a complete disconnect from the physical reality of how servers track time.

When you architect a high throughput retail backend or a multi-region AI platform, relying on system clocks for data synchronization is a recipe for silent data corruption.

The Lie of the System Clock

Let us look at the hardware. Your server motherboard tracks time using a quartz crystal oscillator. These crystals are physical objects. They react to their environment. If the server rack gets hot, the crystal vibrates at a slightly different frequency.

Over a few hours, the clock on Server A will drift away from the clock on Server B.

Engineers try to fix this with the Network Time Protocol. NTP periodically pings an external time server to correct the local clock. But NTP packets have to travel over the network. The network drops packets. Switches queue traffic. Latency spikes.

You cannot measure absolute time over a variable latency network. If Server A has a 5 millisecond network delay to the NTP server, and Server B has a 50 millisecond delay, their clocks will never perfectly align.

Silent Data Loss and Distributed Transactions

This hardware reality becomes a software engineering nightmare when you build distributed systems.

Most distributed databases use a Last Write Wins policy to resolve conflicts. If two nodes receive updates for the same record, the database looks at the timestamps and discards the older one.

Imagine your cloud environment has a minor clock skew. Node A is 10 milliseconds ahead of Node B.

A user updates their password on Node B.

Two milliseconds later, the user updates their profile picture on Node A.

These are sequential actions. But because Node A’s clock is artificially in the future, its timestamp completely overshadows Node B. The database attempts to sync the state, assumes the password update is obsolete, and drops it.

You just lost user data. The database did not throw an error. The logs look completely normal. I have spent miserable nights debugging this exact race condition in distributed retail backends.

Cheating Physics with TrueTime

Google realized they could not solve clock synchronization with software. They decided to solve it with hardware.

When they built Spanner, their globally distributed database, they put GPS receivers and atomic clocks directly into their data center racks. They created an API called TrueTime.

TrueTime does not return a single timestamp. It returns an uncertainty interval. It tells the application the exact bounds of the clock drift. The API guarantees the actual time falls somewhere between earliest and latest.

Google then did something brilliant with this uncertainty.

Before Spanner commits a transaction, it looks at the TrueTime interval. If the uncertainty window is 4 milliseconds, the database thread literally pauses execution for 4 milliseconds. It waits out the uncertainty.

By the time the commit finishes, Spanner mathematically guarantees that the transaction is entirely in the past for every other node on the planet. They trade a tiny bit of write latency for absolute, global consistency.

sequenceDiagram
  participant C as Client
  participant B as Node B
  participant A as Node A
  participant TT as TrueTime

  C->>B: write X (attach local ts_B)
  B->>TT: request interval
  TT-->>B: [earliestB, latestB]
  alt uncertainty small
    B->>A: replicate X
    A->>A: apply with ts_B
  else uncertainty large
    B->>B: wait until latestB
    B->>A: replicate X after wait
    A->>A: apply with ts_B
  end

What the Rest of Us Do

We do not have atomic clocks in our AWS or Azure VPCs. We have to suffer.

If you are building a system that requires strict consistency across multiple regions, you have to work around the unreliable clocks.

You can use logical clocks like Lamport timestamps. These track causality and event order instead of absolute wall clock time.
You can force all writes through a single leader node. You pay a massive cross-region latency tax, but you guarantee a single source of truth.
You can write complex application logic to merge conflicting data structures using CRDTs.

Do not blindly trust system clocks in a distributed environment. The hardware drifts. The network lies. If you rely on NTP to keep your distributed database perfectly synced, you will eventually overwrite valid data.

The Lie of the System Clock

Silent Data Loss and Distributed Transactions

Cheating Physics with TrueTime

What the Rest of Us Do

// SPONSORSHIP

[ RELATED_LOGS ]