Skip to main content

Database Connection Pooling vs Direct Connections

Context

Prism's backend plugins need to connect to data stores (PostgreSQL, Redis, ClickHouse, etc.). Each plugin must decide: use connection pooling or direct connections per request?

The Tradeoff

Connection Pooling:

  • Pre-established connections reused across requests
  • Lower latency (no TCP handshake + auth per request)
  • Fixed resource usage (pool size limits)
  • Complexity: pool management, health checks, stale connection handling

Direct Connections:

  • New connection per request
  • Higher latency (TCP + TLS + auth overhead: ~5-50ms)
  • Unbounded resource usage (connections scale with request rate)
  • Simplicity: no pool management needed

Why This Matters at Scale

From Netflix's experience at 8M QPS:

  • Connection churn kills performance at scale
  • PostgreSQL max_connections: typically 100-200 (too low for high concurrency)
  • Redis benefits from persistent connections (pipelining, reduced latency)
  • But: connection pools can become bottlenecks if undersized

Decision

Use connection pooling by default for all backends, with backend-specific tuning:

Connection Pool Strategy Matrix

BackendPool TypePool Size FormulaRationale
PostgreSQLShared pool per namespacemax(10, RPS / 100)Expensive connections, limited max_connections
RedisShared pool per namespacemax(5, RPS / 1000)Cheap connections, benefits from pipelining
KafkaProducer poolmax(3, num_partitions / 10)Producers are heavyweight, batching preferred
ClickHouseShared pool per namespacemax(5, RPS / 200)Query-heavy, benefits from persistent HTTP/2
NATSSingle persistent connection1 per namespaceMultiplexing over single connection
Object Storage (S3/MinIO)No pool (HTTP client reuse)N/AHTTP client handles pooling internally

Pool Configuration

# Per-backend pool settings
backends:
postgres:
pool:
min_size: 10
max_size: 100
idle_timeout: 300s # Close idle connections after 5 min
max_lifetime: 1800s # Recycle connections after 30 min
connection_timeout: 5s
health_check_interval: 30s

redis:
pool:
min_size: 5
max_size: 50
idle_timeout: 600s # Redis connections are cheap to keep alive
max_lifetime: 3600s
connection_timeout: 2s
health_check_interval: 60s

Rationale

PostgreSQL: Pool Essentials

Why pool?

  • PostgreSQL connection cost: ~10-20ms (TCP + TLS + auth)
  • At 1000 RPS → 10-20 seconds of CPU wasted per second (unsustainable)
  • PostgreSQL's max_connections limit (often 100-200) too low for direct per-request

Sizing:

  • Rule of thumb: pool_size = (total_requests_per_second * avg_query_duration) / num_proxy_instances
  • Example: 5000 RPS * 0.005s avg query / 5 instances = 5 connections per instance
  • Add buffer for spikes: 5 * 2 = 10 connections

Gotchas:

  • Postgres transaction state: ensure proper BEGIN/COMMIT handling
  • Connection reuse: always ROLLBACK on error to clean state
  • Prepared statements: cache per connection for efficiency

Redis: Pool for Pipelining

Why pool?

  • Redis connection cost: ~1-2ms (cheap, but adds up)
  • Pipelining benefits: batch multiple commands over single connection
  • At 10K RPS → 1 connection can handle 100K RPS with pipelining

Sizing:

  • Much smaller pools than PostgreSQL (Redis is single-threaded per instance)
  • More connections don't help unless sharding across Redis instances
  • Example: 50K RPS → 5-10 connections sufficient

Kafka: Producer Pooling

Why pool producers?

  • KafkaProducer is heavyweight (metadata fetching, batching logic)
  • Creating per-request is extremely inefficient
  • One producer can handle 10K+ messages/sec

Sizing:

  • Typically 1-3 producers per partition
  • Example: 10 partitions → 3 producers (round-robin)

Key insight: Kafka producers do internal batching, so pooling amplifies efficiency.

NATS: Single Connection

Why not pool?

  • NATS protocol supports multiplexing over single connection
  • Creating multiple connections adds no benefit (and wastes resources)
  • NATS client libraries handle this internally

Configuration:

// Single NATS connection per namespace
let nats_client = nats::connect(&config.connection_string).await?;
// All requests multiplex over this connection

Object Storage: Client-Level Pooling

Why not explicit pool?

  • HTTP clients (reqwest, hyper) handle connection pooling internally
  • S3 API is stateless, no transaction semantics
  • Client library's default pooling is usually optimal

Configuration:

// HTTP client with built-in connection pool
let s3_client = aws_sdk_s3::Client::from_conf(
aws_sdk_s3::config::Builder::new()
.http_client(
aws_smithy_runtime::client::http::hyper_014::HyperClientBuilder::new()
.build_https() // Uses hyper's connection pool
)
.build()
);

Alternatives Considered

1. Direct Connections (No Pooling)

// Create new connection per request
pub async fn execute(&self, req: ExecuteRequest) -> Result<ExecuteResponse> {
let conn = PostgresConnection::connect(&self.config).await?; // 10-20ms overhead!
let result = conn.query(&req.query).await?;
Ok(result)
}
  • Pros: Simple, no pool management, no connection reuse bugs
  • Cons: Terrible performance (10-20ms overhead per request), exhausts max_connections
  • Rejected because: Unsustainable at any meaningful scale

2. Thread-Local Connections

  • Pros: No contention, one connection per thread
  • Cons: Doesn't work with async (threads != tasks), wastes connections
  • Rejected because: Incompatible with tokio async model

3. Per-Request Connection with Caching

  • Pros: Automatic pooling via LRU cache
  • Cons: Complex TTL management, unclear ownership, health check challenges
  • Rejected because: Reinventing connection pooling poorly

Consequences

Positive

  • 10-100x latency improvement vs direct connections (no TCP handshake per request)
  • Resource efficiency: Fixed connection count prevents database overload
  • Predictable performance: Pool size controls max concurrency
  • Backend protection: Prevents stampeding herd from overwhelming databases

Negative

  • Stale connections: Need health checks and recycling
  • Pool exhaustion: If pool too small, requests queue (but better than overwhelming DB)
  • Complex configuration: Need to tune pool sizes per workload
  • Connection state: Must ensure clean state between reuses (transactions, temp tables)

Neutral

  • Warm-up time: Pools need to fill on startup (min_size connections created)
  • Monitoring: Need metrics on pool utilization, wait times, health check failures
  • Graceful shutdown: Must drain pools cleanly on shutdown

Implementation Notes

PostgreSQL Pool Implementation (Using deadpool-postgres)

use deadpool_postgres::{Config, Pool, Runtime};

pub struct PostgresPlugin {
pool: Pool,
}

impl PostgresPlugin {
pub async fn new(config: PostgresConfig) -> Result<Self> {
let mut cfg = Config::new();
cfg.url = Some(config.connection_string);
cfg.pool = Some(deadpool::managed::PoolConfig {
max_size: config.pool_max_size,
timeouts: deadpool::managed::Timeouts {
wait: Some(Duration::from_secs(5)),
create: Some(Duration::from_secs(5)),
recycle: Some(Duration::from_secs(1)),
},
});

let pool = cfg.create_pool(Some(Runtime::Tokio1))?;

Ok(Self { pool })
}

pub async fn execute(&self, req: ExecuteRequest) -> Result<ExecuteResponse> {
// Get connection from pool (blocks if pool exhausted)
let conn = self.pool.get().await?;

// Execute query
let rows = conn.query(&req.query, &req.params).await?;

// Connection automatically returned to pool when dropped
Ok(ExecuteResponse::from_rows(rows))
}
}

Health Checks

// Periodic health check removes stale connections
async fn health_check_loop(pool: Pool) {
let mut interval = tokio::time::interval(Duration::from_secs(30));

loop {
interval.tick().await;

// Test a connection
match pool.get().await {
Ok(conn) => {
if let Err(e) = conn.simple_query("SELECT 1").await {
warn!("Pool health check failed: {}", e);
// Pool will recreate connection on next checkout
}
}
Err(e) => {
error!("Failed to get connection for health check: {}", e);
}
}
}
}

Pool Metrics

// Expose pool metrics to Prometheus
pub fn record_pool_metrics(pool: &Pool, namespace: &str, backend: &str) {
let status = pool.status();

metrics::gauge!("prism_pool_size", status.size as f64,
"namespace" => namespace, "backend" => backend);

metrics::gauge!("prism_pool_available", status.available as f64,
"namespace" => namespace, "backend" => backend);

metrics::gauge!("prism_pool_waiting", status.waiting as f64,
"namespace" => namespace, "backend" => backend);
}

Configuration Tuning Guidance

Start with conservative sizes: pool_size = max(min_size, expected_p99_rps * p99_query_latency_seconds)


**Example**:
- Expected P99 RPS: 1000
- P99 query latency: 50ms = 0.05s
- Pool size: max(10, 1000 * 0.05) = 50 connections

**Monitor and adjust**:
- If `pool_waiting` metric > 0: pool too small, increase size
- If `pool_available` always ~= `pool_size`: pool too large, decrease size
- If connection errors spike: check database `max_connections` limit

## References

- [PostgreSQL Connection Pooling Best Practices](https://www.postgresql.org/docs/current/runtime-config-connection.html)
- [Redis Pipelining](https://redis.io/docs/manual/pipelining/)
- [HikariCP (Java) Connection Pool Sizing](https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing)
- RFC-008: Proxy Plugin Architecture (where pools live)
- [deadpool-postgres Documentation](https://docs.rs/deadpool-postgres/)

## Revision History

- 2025-10-08: Initial draft with backend-specific pooling strategies