As engineers, we all want the impossible database: infinite write throughput, constant-time reads, and negligible memory usage. The RUM Conjecture formalizes why that system cannot exist on real hardware.
When you design an index or access method, you are trading between three forces:
- Read Overhead (
R) – extra work per lookup. - Update Overhead (
U) – extra work per insert, update, or delete. - Memory Overhead (
M) – extra space for indexes, pointers, and metadata.
The conjecture is blunt: you can strictly optimize at most two of R, U, and M; the third will be expensive.
1. The RUM Conjecture in Practice
The RUM Conjecture comes from work at the Harvard DASLab (Data Systems Laboratory) on indexing and storage engines. It captures a constraint imposed by hardware realities:
- Disks and SSDs have finite bandwidth and non-zero seek costs.
- CPUs and caches have limited capacity and latency per access.
- DRAM is expensive relative to disk and must be used carefully.
Any index that appears to beat these constraints is simply hiding a cost in one corner of the R/U/M triangle.
At a high level:
You can minimize:
- Read Overhead (`R`)
- Update Overhead (`U`)
- Memory Overhead (`M`)
but not all three simultaneously.
2. RUM Tradeoffs: B-Trees, LSM-Trees, and Hash Indexes
To make the conjecture concrete, map it to the engines you already run in production.
2.1 R + M optimized: B-Trees (writes pay the price)
Relational databases such as MySQL and PostgreSQL default to B-Tree indexes for primary and secondary keys.
-
Reads (
R)- B-Trees minimize the height of the tree.
- Lookups require
O(log n)page reads, which is small when pages are large and highly packed.
-
Memory (
M)- Pages store keys in sorted order with tight packing.
- Metadata overhead per record is low; cache lines are used efficiently.
-
Updates (
U)- Random inserts require:
- Traversing the tree from root to leaf.
- Potential page splits when a node is full.
- Cascading updates to parent nodes.
- This makes writes expensive and fragmented, especially with random keys.
- Random inserts require:
RUM corner: B-Trees are effectively R + M optimized, while U is expensive.
2.2 U + M optimized: LSM-Trees (reads pay the price)
Write-heavy systems such as Cassandra and RocksDB use Log-Structured Merge Trees (LSM-Trees).
-
Updates (
U)- Writes append to an in-memory memtable, then flush to disk as immutable SSTables.
- Disk I/O is sequential, which is close to optimal for SSDs and HDDs.
-
Memory (
M)- Data is stored in sorted runs with strong compression.
- Bloom filters and sparse indexes reduce in-memory footprint per record.
-
Reads (
R)- A single key lookup may require:
- Checking the memtable.
- Consulting Bloom filters for multiple SSTables.
- Searching several on-disk files to find the latest version.
- Compaction can reduce this cost, but at the expense of background write I/O.
- A single key lookup may require:
RUM corner: LSM-Trees are U + M optimized, while R is expensive.
2.3 R + U optimized: Hash indexes (memory pays the price)
If you need instant reads and writes for point lookups, you can use a hash index or a direct-addressing table.
-
Reads (
R)- Lookup is
O(1)on average: compute hash, jump to bucket, compare a few entries.
- Lookup is
-
Updates (
U)- Inserts and updates are also
O(1)on average, with small constant factors.
- Inserts and updates are also
-
Memory (
M)- To keep collision rates low, you must over-allocate buckets.
- Structures are not naturally compressible and often require contiguous RAM.
- Hash tables are useless for range scans, so you may need additional indexes, further increasing
M.
RUM corner: Hash-based structures are R + U optimized, while M is expensive.
2.4 RUM comparison table
| Structure | Read Overhead (R) | Update Overhead (U) | Memory Overhead (M) | Typical systems |
|---|---|---|---|---|
| B-Tree | Low | High | Low | MySQL, PostgreSQL |
| LSM-Tree | High | Low | Low | Cassandra, RocksDB |
| Hash index | Low | Low | High | In-memory caches, key–value |
You are not choosing a good or bad engine; you are choosing where to pay.
3. Visualizing the RUM Triangle
The RUM Conjecture is easiest to reason about as a triangle: each access method sits near one edge, emphasizing two dimensions and compromising on the third.
graph TD R["Low Read Overhead (R)"] U["Low Update Overhead (U)"] M["Low Memory Overhead (M)"] R --- U U --- M M --- R BT["B-Tree<br/>(MySQL, Postgres)"] --> R BT --> M LSM["LSM-Tree<br/>(Cassandra, RocksDB)"] --> U LSM --> M HASH["Hash Index"] --> R HASH --> U %% Styling to match the Brutalist dark theme style R fill:#111,stroke:#4ADE80,stroke-width:2px,color:#fff style U fill:#111,stroke:#4ADE80,stroke-width:2px,color:#fff style M fill:#111,stroke:#4ADE80,stroke-width:2px,color:#fff style BT fill:#111,stroke:#60A5FA,stroke-width:1px,color:#fff style LSM fill:#111,stroke:#60A5FA,stroke-width:1px,color:#fff style HASH fill:#111,stroke:#60A5FA,stroke-width:1px,color:#fff
If a vendor claims they are “optimal” at all three corners, the diagram tells you what to ask next: Where is the hidden cost?
4. Engineering Takeaways
4.1 Stop asking “Is this database good?”
The RUM Conjecture reframes the question. Instead of:
“Is database
Xgood or bad?”
Ask:
“Which dimension is
Xoptimizing, and what am I willing to pay for the other two?”
Examples:
-
Analytics / OLAP
- Prioritize reads and compression (
R+M). - Accept slower, batched writes (
U).
- Prioritize reads and compression (
-
Write-heavy telemetry / logging
- Prioritize write throughput and compression (
U+M). - Accept slower, more complex reads (
R).
- Prioritize write throughput and compression (
-
Low-latency key–value cache
- Prioritize reads and writes (
R+U). - Accept high memory bills (
M).
- Prioritize reads and writes (
4.2 Design with explicit RUM budgets
When designing a storage subsystem or selecting a database:
-
Quantify your workload
- Target
reads/sec,writes/sec, and acceptablep99latencies. - Estimate data volume and growth rate in
GBorTB.
- Target
-
Allocate a RUM budget
- Decide which dimension you are willing to make expensive:
- Higher read latency (
R) - Higher write amplification (
U) - Higher memory / storage cost (
M)
- Higher read latency (
- Decide which dimension you are willing to make expensive:
-
Choose structures accordingly
- B-Trees for mixed workloads with strong point and range query needs.
- LSM-Trees when ingest rate dominates and you can tolerate heavier read paths.
- Hash indexes when single-key latency is everything and memory is abundant.
4.3 Use RUM as a vendor litmus test
The next time you see claims like:
“Sub-millisecond reads and writes at petabyte scale, with minimal resource usage.”
Run the RUM checklist:
- If
Ris low andUis low,Mmust be high (or hidden in a separate tier). - If
Ris low andMis low,Uis paying via compaction, page splits, or background jobs. - If
Uis low andMis low,Ris paying with multi-level lookups and higher tail latency.
Physics guarantees that somewhere, the cost shows up.
5. Why the RUM Conjecture Matters
The RUM Conjecture is not just an academic curiosity. It is a design tool:
- It explains why your existing systems behave the way they do under load.
- It guides which storage engine to use per workload instead of defaulting to one database for everything.
- It arms you with the right questions when reviewing proposals, whitepapers, or vendor pitches.
You cannot beat RUM, but you can choose your corner deliberately. The systems that age well are the ones whose engineers knew exactly which trade-offs they were making—and why.
// SPONSORSHIP
If this research saved you time or improved your architecture, consider sponsoring my work on GitHub. All sponsorships go directly toward infrastructure and further technical research.
[ Become a Sponsor ]