For roughly two decades, backend infrastructure has been tuned for a specific client: humans. A click on a dashboard leads to a quick checkout from the pool, a few optimized queries, return of the connection, and a JSON or HTML response. A typical interaction might land near 50 ms of database time.
That speed is what makes connection pooling viable. Middleware like PgBouncer or HikariCP assumes an implicit contract: transactions stay microscopic.
AI agents wired straight into data stores do not behave like humans. They run autonomous loops and, in doing so, stress traditional database architecture in ways pool sizing never assumed.
What a database connection costs
In PostgreSQL, each connection maps to a heavy OS-level process. Spawning one consumes on the order of 10 MB RAM. 5,000 direct connections can mean on the order of 50 GB spent on process overhead, with little left for the data cache.
Connection pools cap live backends: a small pool (for example 100 connections). Because human queries are short (near 50 ms), those 100 slots can support high request rates. High turnover makes the math work.
| Assumption | Typical human path | Naive agent path |
|---|---|---|
| Time holding a pooled connection | ~50 ms | 3-5 s+ while waiting on LLM I/O |
| What the slot is doing | Query work | Often idle on network wait |
| Effect on pool | Reuse | Exhaustion at modest agent concurrency |
The agentic ReAct loop
Agents often follow ReAct (Reason and Act):
- Receive a user prompt.
- The LLM proposes SQL (or a plan) to inspect data.
- The backend runs the SQL.
- Raw rows go back to the LLM to summarize or choose the next step.
Failure mode: implementations that open a transaction, run the query, then keep the connection while the LLM processes the result.
LLM inference over the network is slow relative to a local query. A call to GPT-4 or Claude can sit in the 3-5 s range.
Network I/O and pool exhaustion
Holding a pooled connection during a ~5 s OpenAI-style round trip breaks the microscopic transaction assumption.
Instead of 50 ms occupancy, a slot may sit for 5,000 ms, idle, waiting on GPU-backed inference elsewhere.
With a pool of 100, 100 concurrent agents thinking can occupy every slot. Request 101, perhaps a human loading the homepage, queues. Latency jumps; the app can fail under timeout pressure.
CPU on the database might read 2%, yet the system is blocked because connection slots are tied to network waits, not query work.
Buffer cache destruction
Second-order effect: cache eviction.
Humans tend to hit predictable, indexed paths. The database keeps hot pages in RAM (OS page cache, buffer cache). RAM reads sit in the nanosecond range.
Agents generating ad hoc SQL are less predictable. An agent answering something like seasonal merchandise trends might issue a bad query that scans tens of millions of log rows.
That pulls cold data from SSD at scale. Limited RAM forces eviction of hot user data to make room for the scan working set.
When normal traffic returns, warm pages may be gone. Everyday queries fall back to slower I/O. The agent workload fragments spatial locality for the rest of the system.
Naive hold vs decoupled flow
sequenceDiagram participant U as User / caller participant B as Backend participant P as Pool participant D as PostgreSQL participant L as LLM API U->>B: prompt / task B->>P: acquire connection P->>D: query + transaction scope D-->>B: rows Note over B,D: Naive: keep connection checked out B->>L: send context for reasoning (seconds) L-->>B: next SQL or answer B->>P: release connection
Decoupled pattern: fetch and commit (or end the session use), return the connection immediately, then call the LLM asynchronously with payloads that do not require a live DB handle.
Engineering takeaway
You cannot treat an AI agent like a standard API client.
If you build agentic commerce or autonomous backends, decouple the reasoning layer from the data layer.
- Never hold a connection: fetch the data, release the connection back to the pool immediately, then send the payload to the LLM asynchronously.
- Read replicas: force agent-generated SQL against isolated read replicas so the agent thrashes a secondary cache while the primary stays clean for production traffic.
- Semantic layers: do not let agents issue raw SQL against core tables. Prefer restricted, pre-aggregated APIs or vector stores with clear boundaries.
Software scaling is about understanding where time is spent. Do not let external network latency on inference dictate internal database throughput.
// SPONSORSHIP
If this research saved you time or improved your architecture, consider sponsoring my work on GitHub. All sponsorships go directly toward infrastructure and further technical research.
[ Become a Sponsor ]