Under load, the usual playbook is to grow the thread pool. More threads feel like more concurrency and higher throughput. In production, that curve bends: throughput stops climbing, then drops, while latency explodes. The explanation is not in your framework, it is in how the OS multiplexes work onto fixed silicon.
Hardware reality: concurrency ≠ extra cores
Your runtime may create thousands of threads. The machine still has N physical cores, for example, 16 cores means 16 streams of execution at a given instant. That is a hard cap.
With far more runnable threads than cores, the scheduler time-slices: it gives Core 1 to Thread A for a slice, preempts it, and runs Thread B on the same core. That handoff is a context switch. Software often treats it as cheap; on the hardware, it is not.
What a context switch actually does
When the OS swaps Thread A for Thread B, it must save A’s machine state and restore B’s:
- Save program counter, stack pointer, and CPU registers for A (typically to a per-thread kernel structure).
- Load B’s saved registers and resume execution.
Mechanical overhead is often on the order of 1–2 µs, which still means thousands of instructions worth of time that could have run useful work.
The deeper cost is not register shuffling alone.
Cache pollution: why throughput collapses
Modern CPUs depend on L1 and L2 to keep hot data next to the core. Rough orders of magnitude:
| Source | Typical latency (order of magnitude) |
|---|---|
| L1 | ~1 ns |
| Main RAM | ~100 ns |
While Thread A runs, it warms the caches with A’s working set. After a switch, Thread B finds A’s lines in the hierarchy; B cannot use them as B’s data. B takes cache misses, stalls, and refills from RAM, often 100+ ns stalls per critical access (order-of-magnitude).
If slices are short relative to refill cost, B barely amortizes work before the next preempt. A returns and invalidates the picture again. The core spends most of its time on bookkeeping and memory traffic, not domain logic. That pattern is cache thrashing, a common reason more threads reduce throughput.
Mermaid: many threads vs thread-per-core
Two separate flows: problem first, pattern second, so each diagram stacks vertically instead of competing for horizontal space in one canvas.
Performance trap: thread churn
graph LR TP[Large thread pool] --> SCHED[OS scheduler: timeslice & preempt] SCHED --> CS[Context switch & cache churn] CS --> CORES[CPU physical cores]
High throughput: pinned workers
graph TD A[One worker thread per core] --> B[CPU affinity or pinning] B --> C[CPU physical cores] A --> D[Non-blocking I/O: epoll / io_uring] D -. no block on wire .-> A
Thread-per-core: stop paying for switches you do not need
If preemption on a core is the tax, the design response is: do not schedule competing threads on the same core for that workload.
Thread-per-core means sizing runnable workers to physical cores e.g. 16 cores → 16 threads for that pipeline and using CPU pinning (affinity) so Thread i stays on Core i. The scheduler is not constantly evicting unrelated workloads from that core’s slot, so caches stay hot and the pipeline sees fewer miss-driven stalls.
This is mechanical sympathy: match software parallelism to hardware parallelism.
I/O without blocking the core
If a worker blocks on a socket or DB round-trip, that core goes idle while the world waits, bad when you have one thread budget per core.
Thread-per-core stacks pair with non-blocking, asynchronous I/O:
- Linux:
epoll,io_uring, or similar event interfaces. - Pattern: submit work, register completion; continue other requests instead of parking the thread.
When Thread 1 issues a DB call, it does not spin in a syscall that holds the core for nothing, it queues the operation and processes the next ready work. When data arrives, the runtime resumes the continuation.
That combination few threads, pinned cores, async I/O, is how systems like Redis, Nginx, and ScyllaDB sustain very high ops/sec on modest boxes: they respect core count and minimize scheduler-induced cache churn.
Engineering takeaway
- Concurrency in software is not the same as parallel execution on distinct cores.
- For CPU-bound work, more threads than cores invites timeslicing, context switches, and cache thrashing.
- Size pools to hardware, pin where it matters, and use async I/O so threads do not block cores on I/O wait.
Stop tuning thread pools by instinct alone. Map load to cores, measure scheduler and cache effects, and design so the machine does not fight your abstractions.
// SPONSORSHIP
If this research saved you time or improved your architecture, consider sponsoring my work on GitHub. All sponsorships go directly toward infrastructure and further technical research.
[ Become a Sponsor ]