Every engineering team eventually has the same meeting. The monthly OpenAI or Anthropic API bill arrives. The CFO reacts. A senior engineer says: we should host Llama-3 ourselves. The model is free. We rent a few GPUs and cut costs by ~80%.
That sounds brilliant on a whiteboard. In production it is often what I call expensive cosplay: pretending to be an AI research lab without the physical realities of silicon, memory bandwidth, and 24/7 hardware operations.
Strip the hype and look at the mechanical sympathy required to serve a 70B parameter model in production.
VRAM illusion and the KV cache
The first mistake teams make is calculating hardware needs only from model weight size.
Llama-3 70B loaded in 16-bit precision requires roughly 140 GB of VRAM just to sit idle. A standard NVIDIA A100 has 80 GB of VRAM. You rent two A100s, span the model with tensor parallelism, and assume you are ready for production.
You are forgetting the KV cache.
When an LLM generates text, it is autoregressive: it predicts the next token from all previous tokens. To avoid recalculating the entire prompt for every new token, the GPU stores representations of prior tokens in a Key-Value (KV) cache.
This cache lives in physical VRAM. Facts that matter in production:
- Each concurrent user you add requires a dedicated KV block.
- Large context plus high concurrency makes KV grow; it can exceed model weights in aggressive setups.
- When VRAM fills, you trigger CUDA out-of-memory (OOM) and the server can crash outright.
| What fills VRAM | Role |
|---|---|
Weights (fp16 70B) | ~140 GB baseline to hold parameters |
| KV cache (per user / context) | Grows with sequence length and parallel users |
| Activations / batching | Additional pressure under dynamic batching |
Memory bandwidth wall
The second brutal reality of hosting your own model: LLM inference is rarely compute-bound. It is memory-bandwidth-bound.
To generate a single token, the GPU cannot look at a fraction of the model in the simplified story teams tell: it must read the full ~140 GB weight set from High Bandwidth Memory (HBM) into the compute path for that step (real kernels overlap work, but the weight traffic dominates the intuition).
An A100 has physical memory bandwidth on the order of ~2000 GB/s. If the model is 140 GB, physics gives a rough ceiling of 2000 / 140 ≈ 14 full-weight traversals per second per stream in that back-of-envelope model.
So your absolute theoretical maximum generation speed for one user is on the order of 14 tokens per second before batching, kernels, and implementation details. If you want 10 users at once, you use dynamic batching. Batching increases total throughput but spikes time-to-first-token for the individual. You fight a zero-sum tension between user experience and hardware utilization.
Weights, KV, and bandwidth pressure
graph TD
W[Llama-3 70B weights fp16 ~140GB] --> TP[Tensor parallel e.g. 2x A100 80GB]
TP --> HBM[HBM bandwidth bound decode]
KV[KV cache per user / context] --> VRAM[VRAM headroom]
VRAM --> OOM{CUDA OOM?}
OOM -->|yes| CRASH[Process or worker crash]
HBM --> TPS[Single-stream tok/s capped by ~weight size / HBM read rate]
BATCH[Dynamic batching] --> TPUT[Higher aggregate throughput]
BATCH --> LAT[Higher tail latency for individuals]
3 AM hardware reality
When you use an API, hardware failures are someone else’s problem. When you host your own open-source model, the silicon is your responsibility.
GPUs running at maximum thermal capacity are prone to hardware faults. A single row of bad memory in HBM can silently corrupt model outputs. A degraded PCIe switch between two A100s can make tensor-parallelism synchronization stall and drop throughput toward zero.
When your GPU node kernel panics at 3 AM on a Sunday, your standard web backend engineers may not know how to fix it. You need a dedicated ML infra engineer on pager duty. That single salary can wipe out the API savings you promised the CFO.
Engineering takeaway
Software engineers often treat infrastructure as infinitely elastic. GPU infrastructure is anything but: it is rigid, thermally constrained, and unforgiving.
Before you pivot your architecture to self-hosted open-source models, calculate the true Total Cost of Ownership (TCO):
- You are not only paying hourly GPU rental.
- You pay idle time, batching vs latency tradeoffs, KV cache RAM constraints, and the operational cost of keeping high-performance silicon alive in production.
Unless your inference volume is astronomically high, or your privacy constraints are legally binding, stick to the API. Stop cosplaying as an AI lab. Focus on building your actual product.
// SPONSORSHIP
If this research saved you time or improved your architecture, consider sponsoring my work on GitHub. All sponsorships go directly toward infrastructure and further technical research.
[ Become a Sponsor ]