Skip to main content

MEMO-043: Gap Analysis for RFC-047 and RFC-048

Executive Summary

This memo analyzes RFC-047 (Namespace Reservation with Lease Management) and RFC-048 (Cross-Proxy Partition Strategies) against Prism's core principles of simplicity, reliability, robustness, comprehensibility, and configurability. The analysis identifies critical gaps, potential conflicts with existing ADRs, and missing operational patterns that must be addressed before implementation.

Key Findings:

  1. Missing Namespace Advertisement Protocol: No protobuf definition for how admin pushes namespace assignments (including proxy list, runners, partition strategy) to proxies
  2. Admin HA Architecture Underspecified: Raft cluster design for prism-admin not detailed; proxy discovery mechanism not defined
  3. Load Balancer Integration Missing: No guidance on upstream load balancer configuration for session affinity and consistent hashing
  4. Operational Complexity: JWT lease management combined with partition strategies creates significant operational burden
  5. Testing Gap: No strategy for testing 3-node admin Raft + 3-proxy coordination locally (conflicts with ADR-004)
  6. Missing Observability: Comprehensive metrics, traces, and debugging tools not specified
  7. Configuration Explosion: 20+ tunable parameters without production-ready defaults or guardrails

Recommendation: Address critical gaps before proceeding with implementation. Performance impact of request forwarding is acceptable when mitigated with session-based load balancing at upstream LB.

Core Principles Review

From ADRs

ADR-001: Rust for Proxy

  • Target: P50 <0.3ms, P99 <2ms, 200k RPS
  • No GC pauses, predictable performance
  • Resource efficiency critical

ADR-002: Client-Originated Configuration

  • Self-service with authorization boundaries
  • Clear permission levels (Guided, Advanced, Expert)
  • Fast iteration without infrastructure team bottleneck

ADR-003: Protobuf as Single Source of Truth

  • DRY principle
  • Type safety across all components
  • Generated code for consistency

ADR-004: Local-First Testing

  • Real backends, not mocks
  • Full test suite runs locally in <1 minute
  • Same tests in CI and development

ADR-006: Namespace and Multi-Tenancy

  • Namespaces as isolation boundary
  • Sharded deployments for fault isolation
  • Strong isolation over resource efficiency

ADR-055: Proxy-Admin Control Plane

  • Bidirectional gRPC protocol
  • 256 partitions with consistent hashing
  • Heartbeat every 30s
  • Graceful degradation if admin unavailable

RFC-047 Key Features

  • JWT-based namespace authorization
  • Lease lifecycle with TTL (default 24h)
  • Grace period before expiration (1h)
  • Standalone vs coordinated modes
  • Admin plane stores namespace registry in SQLite
  • Background cleanup job for expired leases

RFC-048 Key Features

  • Three partition strategies: consistent hashing, key range, explicit mapping
  • Request forwarding: transparent (default), redirect, client-side routing
  • 256 partitions fixed
  • Rebalancing protocol for topology changes
  • Partition table cached in each proxy

1. Simplicity Analysis

Concerns

1.1 Dual Operating Modes Create Code Complexity

RFC-047 defines standalone and coordinated modes with different behaviors:

// RFC-047 lines 726-862
pub enum OperatingMode {
Standalone, // No admin plane
Coordinated, // Admin plane coordination
}

Problem: Every namespace operation must check mode and branch logic:

  • Standalone: local registry, self-signed JWT, no expiration
  • Coordinated: admin RPC, RSA-signed JWT, lease expiration

Impact: Doubles code paths, testing surface, and operational complexity.

Recommendation: Start with coordinated mode only. Add standalone mode later if strong use case emerges.

1.2 Three Partition Strategies

RFC-048 supports consistent hashing, key range assignment, and explicit bucket mapping.

Analysis: All three strategies have valid use cases and should be kept.

Use Cases:

  1. Consistent Hashing (Default):

    • General-purpose workload distribution
    • Best for 10+ proxies with dynamic scaling
    • Minimal rebalancing on topology changes
  2. Key Range Assignment:

    • Multi-tenant SaaS (all "acme-*" namespaces on same proxy)
    • Geographic/compliance boundaries (all "eu-*" namespaces in EU region)
    • Easier capacity planning per customer or region
  3. Explicit Bucket Mapping:

    • Resource isolation for critical workloads (prod-* on dedicated proxies)
    • Manual capacity management
    • Specialized hardware requirements

Problem: Operators must choose strategy without clear guidance.

Recommendation:

  • Default: Consistent hashing for all deployments
  • Require ADR-002 "Advanced" Permission: Key range and explicit strategies
  • Decision Tree: Provide flowchart for strategy selection
  • Migration Path: Support strategy changes with data migration tool

1.3 JWT Token Lifecycle Management

RFC-047 requires clients to:

  1. Store JWT tokens securely
  2. Track expiration times
  3. Implement refresh logic (50% of TTL)
  4. Handle refresh failures
  5. Deal with grace period warnings

Problem: Pushes significant complexity to every client library.

Impact:

  • Client SDKs more complex
  • Risk of token expiration bugs in production
  • Difficult to debug token-related issues

Recommendation:

  • Simplify: Issue long-lived tokens (30 days) with explicit revocation
  • Add token refresh to proxy (transparent to clients)
  • Provide client SDK with automatic token management

1.4 Configuration Complexity

Count of configuration parameters:

RFC-047:

  • Lease: TTL, grace period, cleanup interval, min/max TTL (5 params)
  • JWT: algorithm, private key, public key, issuer (4 params)
  • Mode: standalone vs coordinated (1 param)

RFC-048:

  • Partition: count, hash function, rebalance threshold, move interval (4 params)
  • Forwarding: mode, timeout, max hops, connection pool settings (6 params)
  • Strategy: type, ranges, bucket mappings (3 params)

Total: 23+ tunable parameters

Problem: No production-ready defaults provided.

Recommendation:

  • Define "golden path" configuration with safe defaults
  • Mark advanced settings as "expert only"
  • Provide configuration profiles: development, staging, production

Simplicity Score: 3/10

Rationale: Too many modes, strategies, and parameters without clear guidance.

2. Reliability Analysis

Concerns

2.1 Admin Plane High Availability Architecture

RFC-047/048 don't specify admin plane HA design, creating availability risk.

Required Architecture:

┌─────────────────────────────────────────┐
│ prism-admin (3-node Raft cluster) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐│
│ │ Leader │ │ Follower │ │ Follower ││
│ │ (writes) │ │ (reads) │ │ (reads) ││
│ └────┬─────┘ └──────────┘ └──────────┘│
└───────┼──────────────────────────────────┘
│ NamespaceAdvertisement:
│ {
│ namespace: "orders-prod",
│ partition_id: 42,
│ proxies: ["proxy-a:8980", "proxy-b:8980"],
│ runners: [...],
│ config: {...},
│ partition_strategy: "consistent_hash"
│ }

┌────────────────────────────────────────┐
│ Proxy-A (discovers Proxy-B from msg) │
│ - Caches namespace → proxy mapping │
│ - Forwards requests to Proxy-B │
│ - Does NOT run Raft │
└────────────────────────────────────────┘

Failure Scenarios:

ScenarioMitigationStatus
Admin leader downRaft elects new leader (<5s)✅ Handled by Raft
Admin split-brainRaft consensus prevents✅ Handled by Raft
All admin nodes downProxy uses last-known-good config⚠️ Needs specification
Admin storage corruptionRaft log replication prevents✅ Handled by Raft

Critical Gaps:

  1. No NamespaceAdvertisement Protobuf: Message structure not defined
  2. Proxy Discovery Mechanism: How proxy learns about other proxies not specified
  3. Admin Endpoint List: Proxy should connect to all 3 admin nodes, not just one
  4. Raft Implementation: Which Raft library (hashicorp/raft, etcd/raft)?

Recommendations:

  1. Define NamespaceAdvertisement Message: Add to proto/prism/admin/v1/namespace.proto
  2. Admin Endpoint List: Proxy config takes comma-separated list: admin-1:8981,admin-2:8982,admin-3:8983
  3. Fallback Mode: Proxy operates with last-known-good namespace configuration (specify caching strategy)
  4. Circuit Breaker: Proxy stops calling admin after N failures, uses local config
  5. Raft Library: Recommend hashicorp/raft for maturity and Go integration

2.2 JWT Token Expiration During Operations

RFC-047 line 1072: "Tokens become invalid after expiration"

Problem: What happens if token expires mid-operation?

Example:

  1. Client starts long-running transaction at t=0
  2. Token expires at t=24h
  3. Transaction tries to commit at t=24h + 1min
  4. Auth failure → transaction aborted

Impact: Data loss, orphaned resources, inconsistent state.

Recommendations:

  1. Token validation at request start, not during execution
  2. Grace period for in-flight operations (5 minute extension)
  3. Proxy pre-emptively refreshes tokens before expiration
  4. Client SDK auto-retries with refreshed token

2.3 Partition Rebalancing Disruption

RFC-048 lines 845-876 describe rebalancing protocol but don't address:

Questions:

  • What happens to in-flight requests during partition move?
  • How long does a partition move take?
  • Can clients still send requests during rebalancing?
  • What if new proxy fails during partition handoff?

Recommendations:

  1. Two-Phase Commit: Old proxy drains, new proxy prepares, then cutover
  2. Request Buffering: Queue requests during transition (<100ms)
  3. Rollback Protocol: Revert partition move if new proxy fails
  4. Rebalancing Timeouts: Abort move after 30 seconds, restore old assignment

2.4 Split-Brain Scenarios

Scenario: Proxy loses connection to admin but continues serving traffic.

RFC-047 line 516: "Audit log all reservation operations"

Problem:

  1. Proxy A disconnected from admin, serves namespace X with stale config
  2. Admin reassigns namespace X to Proxy B
  3. Both proxies serve namespace X with different configs

Impact: Data corruption, duplicate writes, inconsistent reads.

Recommendations:

  1. Fencing Tokens: Admin issues generation number, proxy rejects requests with old generation
  2. Lease Heartbeat: Namespace becomes read-only if proxy misses 2 heartbeats to admin
  3. Partition Ownership Verification: Proxy validates partition ownership every 30s
  4. Admin Authority: Proxy trusts admin's view, not local cache

2.5 Background Cleanup Job Failures

RFC-047 lines 684-721: Background job deletes expired namespaces.

Problem:

  • What if cleanup job crashes?
  • What if deletion fails (backend unavailable)?
  • What if namespace still has active connections?

Recommendations:

  1. Soft Delete: Mark namespace as deleted, actually delete after 7 days
  2. Delete Retry: Exponential backoff, alert after 3 failures
  3. Active Connection Check: Don't delete if sessions exist
  4. Manual Override: Admin can force delete or restore namespace

Reliability Score: 4/10

Rationale: Critical failure modes underspecified, no HA strategy for admin.

3. Robustness Analysis

Concerns

3.1 Request Forwarding with Upstream Load Balancer Integration

RFC-048 line 511: "Extra network hop for non-local requests, increased latency (~1-2ms)"

Analysis: Request forwarding latency is acceptable when upstream load balancer provides session affinity.

With Random Load Balancing (No Affinity):

  • Probability request lands on correct proxy: 25% (4 proxies, 256 partitions)
  • 75% of requests incur forwarding penalty (~1.5ms)
  • P50 latency: 0.25 * 0.3ms + 0.75 * (0.3ms + 1.5ms) = 1.4ms

With Session-Based Load Balancing (Namespace Affinity):

  • Load balancer hashes on X-Namespace header → always routes to same proxy
  • 100% of requests land on correct proxy (no forwarding)
  • P50 latency: 0.3ms (ADR-001 target met)
  • Forwarding only used for failover or rebalancing

Upstream Load Balancer Configuration (HAProxy example):

backend prism_proxies
balance uri depth 1 # Hash on first path segment (namespace)
hash-type consistent # Consistent hashing

# Extract namespace from path or header
http-request set-var(txn.namespace) path,field(2,/)

server proxy-a proxy-a:8980 check
server proxy-b proxy-b:8980 check
server proxy-c proxy-c:8980 check

Critical Gap: RFC-048 doesn't specify load balancer integration or header standards.

Recommendations:

  1. Standardize Headers: Clients MUST send X-Prism-Namespace header on all requests
  2. Load Balancer Guide: Provide HAProxy, Envoy, and NGINX configurations
  3. Fallback Forwarding: Proxies still support forwarding for failover scenarios
  4. Benchmark: Measure actual forwarding overhead (validate 1-2ms assumption)
  5. Operational Mode: Document load balancer affinity as recommended production setup

3.2 Partition Table Consistency

RFC-048 lines 682-691: Partition table cached in each proxy, updated via admin push.

Problem: What if partition table update fails on subset of proxies?

Scenario:

  1. Admin updates partition table: namespace X moves from Proxy A → Proxy B
  2. Update succeeds on Proxy A, fails on Proxy C
  3. Client connects to Proxy C, which still routes to Proxy A
  4. Proxy A rejects request (no longer owns partition)

Impact: Request failures, client retries, increased latency.

Recommendations:

  1. Versioned Partition Table: Include version number, proxies reject requests if stale
  2. Admin Push Retry: Admin retries failed updates with exponential backoff
  3. Proxy Pull: Proxies poll admin every 30s for latest partition table
  4. Graceful Redirect: Old proxy returns "moved to Proxy B" error with new address

3.3 JWT Revocation Delay

RFC-047 line 1076: "Lease table tracks active tokens for revocation"

Problem: JWT is stateless, proxy validates with public key only. How does revocation work?

Current Approach: Admin updates lease table, but proxy still accepts JWT until expiration.

Impact: Revoked tokens remain valid for up to 24 hours.

Recommendations:

  1. Short-Lived Tokens: 1 hour TTL, auto-refresh in proxy
  2. Revocation List: Proxy checks admin revocation list on every request (cached with 1 min TTL)
  3. Opaque Tokens: Replace JWT with random tokens, proxy validates with admin on every request
  4. Hybrid: JWT for auth, opaque lease_id for quick revocation

3.4 Namespace Name Collisions

RFC-047 line 469-477: Check namespace existence with serializable transaction.

Problem: Serializable isolation may cause high contention with many concurrent reservations.

Performance Impact:

  • 100 concurrent ReserveNamespace requests
  • Each holds serializable lock
  • Throughput drops to ~10 RPS (sequential execution)

Recommendations:

  1. Optimistic Locking: Use UNIQUE constraint, retry on conflict
  2. Advisory Locks: PostgreSQL-specific advisory locks for lower overhead
  3. Pre-Allocation: Client checks availability before transaction
  4. Rate Limiting: Limit ReserveNamespace to 10 RPS per proxy

3.5 Partition Rebalancing Frequency

RFC-048 line 719: "Rebalancing triggered when imbalance exceeds threshold"

Problem: No guidance on:

  • How often to check for imbalance?
  • What's acceptable rebalancing frequency?
  • Cost of rebalancing (data movement, connection disruption)?

Recommendations:

  1. Cooldown Period: Minimum 5 minutes between rebalances
  2. Cost Model: Estimate rebalancing cost (affected namespaces, data size, connections)
  3. Rebalancing Windows: Only rebalance during low-traffic hours
  4. Manual Override: Require operator approval for rebalances affecting >10% of partitions

Robustness Score: 5/10

Rationale: Performance trade-offs not quantified, consistency mechanisms underspecified.

4. Comprehensibility Analysis

Concerns

4.1 Namespace Lifecycle Unclear

Namespace lifecycle spans two RFCs:

RFC-047: Reservation → Active → Grace Period → Expired → Purged RFC-048: Namespace → Partition ID → Proxy Assignment → Pattern Runner Startup

Problem: How do these interact? Which RFC takes precedence?

Example Questions:

  • Does namespace reservation (RFC-047) automatically assign partition (RFC-048)?
  • Can namespace exist in RFC-047 registry but not RFC-048 partition table?
  • If lease expires, does partition assignment remain?

Recommendations:

  1. Unified Lifecycle Diagram: Show combined state machine across both RFCs
  2. State Synchronization: Specify how RFC-047 and RFC-048 states sync
  3. Single RFC: Merge RFC-047 and RFC-048 into one comprehensive RFC
  4. API Clarity: ReserveNamespace returns both JWT token AND partition assignment

4.2 JWT Structure Documentation

RFC-047 lines 376-398: JWT token structure with standard + custom claims.

Problem:

  • Which claims are required vs optional?
  • What does each permission string mean?
  • How does proxy enforce permissions?

Example: "permissions": ["namespace:configure", "pattern:create"]

  • What's the permission syntax?
  • Are there more permissions not listed?
  • How to request additional permissions?

Recommendations:

  1. Permission Schema: Define all possible permissions in protobuf
  2. RBAC Documentation: Explain role-based access control model
  3. Example Scenarios: Show JWT for different user roles (admin, developer, read-only)
  4. Validation Logic: Document how proxy checks permissions

4.3 Partition Strategy Selection

RFC-048 supports 3 strategies but no decision guidance.

Operator Questions:

  • When should I use consistent hashing vs key range?
  • What are the trade-offs?
  • Can I change strategies later?
  • How do I know if my strategy is working well?

Recommendations:

  1. Decision Tree: Flowchart for strategy selection
  2. Use Case Matrix: Strategy vs use case compatibility
  3. Migration Guide: How to switch strategies safely
  4. Monitoring: Metrics to evaluate strategy effectiveness

4.4 Forwarding Mode Confusion

RFC-048 lines 494-547: Three forwarding modes.

Problem: Which mode to use? Each has different trade-offs.

ModeLatencyClient ComplexityOps Complexity
Transparent+1-2msLowMedium
Redirect+1 RTT first requestMediumLow
Client-SideOptimalHighHigh

Recommendations:

  1. Default Mode: Specify transparent forwarding as default
  2. Migration Path: Start transparent, evolve to client-side routing
  3. Feature Flag: Allow per-namespace forwarding mode override
  4. Performance Guide: Document when to switch modes

4.5 Error Messages

Both RFCs lack error taxonomy.

Examples:

  • "Error: Namespace already exists" - Where? On which proxy? Can I use it?
  • "Error: Lease expired" - How to recover? Re-reserve namespace?
  • "Error: Partition not assigned" - Is this transient? Should I retry?

Recommendations:

  1. Error Catalog: Enumerate all error types with recovery actions
  2. Error Codes: Unique code per error type (PRISM-NS-001, etc.)
  3. Actionable Messages: Include next steps in error message
  4. Troubleshooting Guide: Common errors and solutions

Comprehensibility Score: 4/10

Rationale: Critical concepts span multiple documents, lack clear integration.

5. Configurability Analysis

Concerns

5.1 Configuration Parameter Explosion

RFC-047 Parameters:

namespace_management:
mode: coordinated
admin_endpoint: "..."
jwt_secret: "..."
refresh:
enabled: true
check_interval: 30m
refresh_threshold: 0.5
grace_period_warning: 1h

namespace_registry:
jwt_private_key: "..."
jwt_public_key: "..."
jwt_algorithm: RS256
default_lease_ttl: 24h
max_lease_ttl: 168h
min_lease_ttl: 1h
grace_period: 1h
cleanup:
enabled: true
interval: 1h
delete_after_expiry: 1h

RFC-048 Parameters:

partition_management:
strategy: consistent_hash
consistent_hash:
partition_count: 256
hash_function: crc32
rebalance_on_topology_change: true
rebalancing:
enabled: true
auto_rebalance: false
imbalance_threshold: 0.1
min_partition_move_interval: 5m

request_forwarding:
enabled: true
mode: transparent
forward_timeout: 30s
max_forwarding_hops: 1
proxy_connections:
max_idle: 10
max_open: 100
idle_timeout: 5m

partition_table:
refresh_interval: 30s
cache_size: 10000

Problem: 23+ parameters, no production defaults, no validation.

Recommendations:

  1. Configuration Profiles:

    • development.yaml: Short TTLs, small pools, verbose logging
    • production.yaml: Production-ready defaults, conservative settings
    • performance.yaml: Optimized for throughput, relaxed safety
  2. Required vs Optional: Mark required fields clearly

  3. Validation: Startup validation for incompatible settings:

    # INVALID: Standalone mode with admin endpoint
    namespace_management:
    mode: standalone
    admin_endpoint: "..." # ❌ Ignored in standalone mode
  4. Smart Defaults: Calculate derived parameters:

    # If not specified, auto-calculate:
    # refresh_threshold = 0.5
    # grace_period = 0.1 * default_lease_ttl
    # cleanup.interval = grace_period / 2

5.2 Runtime Configuration Changes

Problem: Which settings can change without restart?

Current State: Unclear which parameters are hot-reloadable.

Recommendations:

  1. Hot Reload: Mark reloadable parameters:

    • rebalancing.auto_rebalance: Yes
    • jwt_private_key: No (security risk)
    • partition_count: No (requires migration)
  2. SIGHUP Handler: Reload config on signal

    kill -HUP $(pgrep prism-proxy)
  3. Config Versioning: Track config version, log changes

    [INFO] Config reloaded: version 5 → 6
    [INFO] Changed: rebalancing.auto_rebalance: false → true

5.3 Per-Namespace Configuration Overrides

Problem: Can individual namespaces override global settings?

Example Use Case: Critical namespace needs longer lease TTL.

Current State: RFC-047 line 258 shows lease_ttl in request, but enforcement unclear.

Recommendations:

  1. Override Policy: Define which settings can be overridden per-namespace
  2. Permission Required: Only Advanced/Expert teams can override
  3. Limits: Max override: 2x global setting
  4. Audit: Log all overrides for compliance

5.4 Configuration Drift Detection

Problem: Proxy config differs from admin config.

Scenario:

  1. Operator updates proxy config file manually
  2. Admin pushes different configuration
  3. Proxy has conflicting settings

Recommendations:

  1. Config Source Priority: Admin > local file > defaults
  2. Diff Detection: Proxy logs config mismatches
  3. Reconciliation: Admin can query proxy config, detect drift
  4. Enforcement Mode: Proxy rejects local config if admin available

Configurability Score: 5/10

Rationale: Too many parameters without guidance, unclear hot-reload behavior.

6. Missing Patterns and Practices

6.1 Observability

Missing Metrics:

# Namespace operations
prism_namespace_reservations_total{status="success|error"}
prism_namespace_lease_refreshes_total{status="success|error"}
prism_namespace_lease_expirations_total
prism_namespace_active_leases

# Partition distribution
prism_partition_assignments_per_proxy{proxy_id}
prism_partition_rebalances_total{strategy}
prism_partition_rebalance_duration_seconds
prism_partition_imbalance_ratio

# Forwarding
prism_forwarding_requests_total{from_proxy, to_proxy, status}
prism_forwarding_latency_seconds{from_proxy, to_proxy, quantile}
prism_forwarding_failures_total{reason}

# Admin plane
prism_admin_grpc_requests_total{method, status}
prism_admin_partition_table_version
prism_admin_proxy_count{status="healthy|stale|disconnected"}

Missing Traces:

Namespace Reservation Trace:
span: reserve_namespace
span: validate_name
span: check_uniqueness
span: generate_jwt
span: persist_namespace
span: assign_partition
span: notify_proxy

Partition Rebalance Trace:
span: rebalance_partitions
span: calculate_moves
span: prepare_target_proxy
span: update_partition_table
span: distribute_assignments
span: activate_target
span: drain_source

Missing Logs:

Structured Logging Fields:
namespace: string
partition_id: int
proxy_id: string
lease_id: string
operation: string
actor: string
result: success|error
latency_ms: float
error_code: string

Recommendations:

  1. Observability RFC: Dedicated RFC for metrics, traces, logs
  2. OpenTelemetry: Use OTEL for unified observability
  3. Dashboards: Pre-built Grafana dashboards for namespace operations
  4. Alerting: Sample alert rules for production

6.2 Disaster Recovery

Missing:

  • Backup strategy for namespace registry (SQLite)
  • Point-in-time recovery
  • Cross-region namespace replication
  • Namespace export/import

Recommendations:

  1. Continuous Backup: Replicate SQLite to S3 every 5 minutes
  2. Snapshots: Daily namespace registry snapshots
  3. Export Tool: prismctl namespace export --all > backup.json
  4. Restore Procedure: Documented recovery steps

6.3 Migration Paths

Missing:

  • Standalone → Coordinated mode migration
  • Single proxy → Multi-proxy migration
  • Partition strategy change migration

Recommendations:

  1. Migration ADR: Document migration procedures
  2. Zero-Downtime Migration: Dual-write pattern during transition
  3. Rollback Plan: Ability to revert migration
  4. Migration Tool: prismctl migrate standalone-to-coordinated

6.4 Testing Strategy

Missing (conflicts with ADR-004):

  • How to test 3-node admin Raft cluster locally?
  • How to test JWT lease management?
  • How to test partition rebalancing across 3 proxies?
  • How to test admin leader election and failover?
  • How to test split-brain scenarios?

Recommendations:

  1. Local Multi-Node Docker Compose (ADR-004 compliant):

    # docker-compose.test-multi-proxy.yml
    version: '3.9'

    services:
    # 3-node Raft cluster for prism-admin
    admin-1:
    image: prism-admin:dev
    environment:
    RAFT_NODE_ID: 1
    RAFT_CLUSTER_PEERS: admin-1:8981,admin-2:8982,admin-3:8983
    RAFT_DATA_DIR: /data/raft
    ports:
    - "8981:8981"
    healthcheck:
    test: ["CMD", "grpc-health-probe", "-addr=:8981"]
    interval: 5s

    admin-2:
    image: prism-admin:dev
    environment:
    RAFT_NODE_ID: 2
    RAFT_CLUSTER_PEERS: admin-1:8981,admin-2:8982,admin-3:8983
    RAFT_DATA_DIR: /data/raft
    ports:
    - "8982:8981"
    healthcheck:
    test: ["CMD", "grpc-health-probe", "-addr=:8981"]
    interval: 5s

    admin-3:
    image: prism-admin:dev
    environment:
    RAFT_NODE_ID: 3
    RAFT_CLUSTER_PEERS: admin-1:8981,admin-2:8982,admin-3:8983
    RAFT_DATA_DIR: /data/raft
    ports:
    - "8983:8981"
    healthcheck:
    test: ["CMD", "grpc-health-probe", "-addr=:8981"]
    interval: 5s

    # 3 proxy instances
    proxy-a:
    image: prism-proxy:dev
    environment:
    PROXY_ID: proxy-a
    ADMIN_ENDPOINTS: admin-1:8981,admin-2:8982,admin-3:8983
    ports:
    - "8980:8980"
    depends_on:
    - admin-1
    - admin-2
    - admin-3

    proxy-b:
    image: prism-proxy:dev
    environment:
    PROXY_ID: proxy-b
    ADMIN_ENDPOINTS: admin-1:8981,admin-2:8982,admin-3:8983
    ports:
    - "8990:8980"
    depends_on:
    - admin-1
    - admin-2
    - admin-3

    proxy-c:
    image: prism-proxy:dev
    environment:
    PROXY_ID: proxy-c
    ADMIN_ENDPOINTS: admin-1:8981,admin-2:8982,admin-3:8983
    ports:
    - "9000:8980"
    depends_on:
    - admin-1
    - admin-2
    - admin-3
  2. Test Helpers:

    #[tokio::test]
    async fn test_lease_expiration() {
    let mut time = MockTime::new();
    let admin = TestAdmin::new(&time);
    let proxy = TestProxy::new(&admin);

    let ns = proxy.reserve_namespace("test").await.unwrap();
    time.advance(Duration::from_hours(25)); // Past expiration

    let result = proxy.configure_namespace("test", ns.token).await;
    assert!(result.is_err()); // Token expired
    }

    #[tokio::test]
    async fn test_admin_leader_failover() {
    // Start 3-node admin cluster
    let cluster = TestAdminCluster::new(3).await;

    // Kill leader
    let leader = cluster.current_leader();
    cluster.kill_node(leader).await;

    // Verify new leader elected within 5s
    let new_leader = cluster.wait_for_leader(Duration::from_secs(5)).await.unwrap();
    assert_ne!(new_leader, leader);

    // Verify proxies still work
    let proxy = TestProxy::connect_to_cluster(&cluster).await;
    let ns = proxy.reserve_namespace("test").await.unwrap();
    assert!(ns.success);
    }
  3. Chaos Testing Framework:

    # Kill admin leader, verify election
    prism-test chaos kill-leader --cluster admin --wait-election

    # Network partition: split admin cluster 2-1
    prism-test chaos partition --nodes admin-1,admin-2 --isolate admin-3 --duration 30s

    # Kill proxy, verify namespace reassignment
    prism-test chaos kill-proxy proxy-a --verify-rebalance
  4. Integration Test Suite:

    • Namespace reservation across 3 proxies
    • Partition rebalancing when proxy added/removed
    • Admin leader election and failover
    • Namespace advertisement to all proxies
    • Request forwarding between proxies
    • JWT token refresh and expiration
  5. Load Tests: Measure rebalancing impact with 1000 RPS load

6.5 Security

Missing:

  • mTLS between proxy and admin?
  • JWT signing key rotation?
  • Audit log encryption?
  • Namespace access logging?

Recommendations:

  1. mTLS: Mandatory mutual TLS for proxy-admin communication
  2. Key Rotation: Automate JWT signing key rotation every 90 days
  3. Audit Encryption: Encrypt audit logs at rest
  4. Access Logs: Log all namespace access with principal identity

6.6 Quota and Rate Limiting

Missing:

  • Rate limit for ReserveNamespace (prevent DoS)
  • Namespace quota per team
  • Partition assignment fairness

Recommendations:

  1. Rate Limits:

    rate_limits:
    reserve_namespace: 10/minute per proxy
    refresh_lease: 100/minute per namespace
    partition_table_refresh: 1/second per proxy
  2. Quotas (integrates with ADR-002 authorization boundaries):

    teams:
    - name: user-platform-team
    quotas:
    max_namespaces: 50
    max_partitions: 100
  3. Fairness: Prevent single team from monopolizing partitions

6.7 Documentation

Missing:

  • Runbooks for common operational tasks
  • Troubleshooting guide
  • Performance tuning guide
  • Security best practices

Recommendations:

  1. Runbooks:

    • "Namespace won't reserve - troubleshooting"
    • "Partition rebalancing stuck - recovery"
    • "Admin plane down - emergency procedure"
  2. Decision Guides:

    • "Choosing partition strategy"
    • "Configuring lease TTL"
    • "When to enable auto-rebalancing"
  3. Operational Playbooks:

    • "Adding new proxy to fleet"
    • "Decommissioning proxy"
    • "Emergency namespace recovery"

7. Recommendations

Priority 1: Critical (Block Implementation)

  1. Define Namespace Advertisement Protocol:

    • Add NamespaceAdvertisement protobuf message to proto/prism/admin/v1/namespace.proto
    • Include: namespace, partition_id, proxy_list, runner_list, config, partition_strategy
    • Specify push mechanism from admin to all proxies
    • Define caching strategy in proxy
  2. Admin HA Architecture:

    • Design 3-node Raft cluster for prism-admin (recommend hashicorp/raft)
    • Proxy does NOT run Raft (discovers peers through namespace advertisements)
    • Define failure modes and recovery procedures
    • Implement circuit breaker in proxy for admin communication
    • Admin endpoint list in proxy config: admin-1:8981,admin-2:8982,admin-3:8983
  3. Load Balancer Integration Guide:

    • Standardize X-Prism-Namespace header requirement for all client requests
    • Provide HAProxy, Envoy, and NGINX configuration examples
    • Document session affinity strategy (consistent hashing on namespace)
    • Specify fallback forwarding for failover scenarios
    • Benchmark actual forwarding overhead (validate 1-2ms assumption)
  4. Testing Strategy (ADR-004 compliant):

    • Create docker-compose with 3 admin nodes (Raft) + 3 proxies
    • Test helpers for namespace advertisement flow
    • Chaos tests: admin leader election, network partition, proxy failure
    • Integration tests: namespace reservation, partition rebalancing, request forwarding

Priority 2: High (Before Production)

  1. Partition Strategy Guidance:

    • Keep all three strategies (consistent hashing, key range, explicit)
    • Provide decision tree for strategy selection
    • Require ADR-002 "Advanced" permission for key range and explicit strategies
    • Default to consistent hashing for all deployments
    • Document use cases: multi-tenant SaaS (key range), resource isolation (explicit)
  2. Merge RFCs:

    • Combine RFC-047 and RFC-048 into single comprehensive RFC
    • Create unified namespace lifecycle diagram
    • Clarify state synchronization between reservation and partitioning
    • Add NamespaceAdvertisement as core protocol element
  3. Configuration Profiles:

    • Provide development, staging, production configurations
    • Document all parameters with safe defaults
    • Implement startup validation
  4. Observability:

    • Define complete metrics taxonomy
    • Implement distributed tracing
    • Create operational dashboards
  5. Error Handling:

    • Create error catalog with recovery actions
    • Implement graceful degradation
    • Define retry policies

Priority 3: Medium (Post-MVP)

  1. Advanced Features:

    • Add standalone mode (if strong use case emerges)
    • Implement advanced forwarding modes (redirect, client-side routing)
    • Partition strategy migration tool
  2. Migration Tools:

    • Build migration utilities (standalone → coordinated, strategy changes)
    • Document upgrade procedures
    • Test rollback scenarios
  3. Security Hardening:

    • Implement mTLS between proxy and admin
    • Add JWT signing key rotation (every 90 days)
    • Encrypt audit logs at rest

Priority 4: Low (Future Enhancements)

  1. Optimization:

    • Advanced load balancer integration (client-side partition calculation)
    • Connection pooling optimizations
    • Partition count scaling (>256)
  2. Advanced Operations:

    • Cross-region namespace replication
    • Namespace export/import tools
    • Automated capacity planning based on metrics

8. Alignment with Core Goals

Simplicity: 5/10

Current: Multiple modes and strategies, but all have valid use cases.

Target: Clear default path (coordinated + consistent hashing) with documented advanced options.

Actions:

  • Start with coordinated mode only (add standalone later if needed)
  • Default to consistent hashing, require "Advanced" permission for other strategies
  • Provide production-ready configuration templates with safe defaults

Reliability: 6/10

Current: Admin HA architecture specified, but NamespaceAdvertisement protocol missing.

Target: 3-node Raft admin cluster, graceful degradation, comprehensive error handling.

Actions:

  • Define NamespaceAdvertisement protobuf message
  • Implement 3-node Raft cluster for prism-admin (hashicorp/raft)
  • Define all failure modes and recovery procedures
  • Add circuit breakers and fallback logic

Robustness: 6/10

Current: Performance acceptable with load balancer, but consistency mechanisms underspecified.

Target: Documented load balancer integration, well-defined consistency guarantees.

Actions:

  • Create load balancer integration guide (HAProxy, Envoy, NGINX)
  • Benchmark actual forwarding latency (validate 1-2ms assumption)
  • Implement partition table versioning
  • Define consistency levels explicitly

Comprehensibility: 5/10

Current: Concepts span multiple documents, lack integration.

Target: Single cohesive design, clear documentation, examples.

Actions:

  • Merge RFCs into one document
  • Create visual diagrams
  • Provide end-to-end examples

Configurability: 6/10

Current: Many parameters without guidance.

Target: Safe defaults, clear override policies, runtime reconfiguration.

Actions:

  • Define configuration profiles
  • Mark hot-reloadable settings
  • Implement validation

9. Conclusion

RFC-047 and RFC-048 provide a solid foundation for multi-proxy namespace coordination, but significant gaps must be addressed before implementation:

  1. Simplify: Remove dual modes and multiple strategies
  2. Harden: Add admin HA and comprehensive error handling
  3. Validate: Benchmark performance claims
  4. Test: Build local testing framework
  5. Document: Merge RFCs, create operational guides

Recommendation: Refactor RFCs to address Priority 1 and Priority 2 items before proceeding with implementation. Focus on getting the "happy path" rock-solid before adding advanced features.

Estimated Effort: 2-3 weeks to address critical gaps, 4-6 weeks for full refinement.

Next Steps:

  1. Team review of this memo
  2. Prioritize gap remediation work
  3. Create focused follow-up RFCs for Priority 1 items
  4. Update RFC-047 and RFC-048 with clarifications

References

Revision History

  • 2025-10-25: Initial gap analysis - comprehensive review of RFC-047 and RFC-048