mailboxleaselifecycleroutingpattern-runneradmin-planettlraft

Status: ProposedAuthor: Platform TeamCreated: Oct 27, 2025Updated: Oct 27, 2025

RFC-049: Mailbox Lifecycle, Lease Management, and Routing Coordination

Abstract

This RFC specifies lifecycle management, lease coordination, and routing for the Mailbox Pattern (RFC-037). Pattern runners must establish both a routable identity and TTL-based lease for their mailbox instances. The RFC explores three lease backend options (etcd, Redis, raft-based admin) and recommends raft-based admin storage to simplify deployment and align with project philosophies. Pattern runners hold leases via the admin plane, communicate their availability through session protocols, and enable proxies to route mailbox queries to the correct backend.

Key Benefits:

Simplified Deployment: No external etcd or lease infrastructure required
Built-In Lease Management: Raft consensus in admin plane provides distributed coordination
Routable Identity: Each mailbox has discoverable address for query routing
TTL-Based Lifecycle: Automatic cleanup of abandoned mailboxes
Cross-Proxy Coordination: Admin plane shares mailbox routing info with all proxies
Aligned with Project Philosophy: Local-first testing, minimal dependencies (ADR-004)

Motivation

Problem Statement

The Mailbox Pattern (RFC-037) stores events in searchable storage but lacks lifecycle management:

Problem 1: No Routable Identity

Clients don't know which backend instance holds their mailbox
Queries may target wrong proxy/pattern-runner
No service discovery for mailbox locations

Problem 2: No TTL Management

Mailboxes created but never queried remain indefinitely
No mechanism to expire inactive mailboxes
Resource waste from abandoned storage

Problem 3: No Lease Coordination

Pattern runners don't hold leases for their mailbox assignments
Multiple runners could claim same mailbox (split-brain)
No graceful shutdown mechanism

Problem 4: No Routing Infrastructure

Proxies don't know which proxy/runner owns a mailbox
Cannot forward mailbox queries to correct backend
Mailbox query requests fail or route incorrectly

Problem 5: Pattern Runner Backend Dependencies

Pattern runners need distributed coordination
Options: etcd, Redis, raft-based admin
Unclear which backend aligns with project goals

Goals

Routable Identity: Each mailbox has discoverable address (proxy_id + runner_id + partition_id)
TTL-Based Leases: Pattern runners hold leases for mailbox assignments with automatic expiration
Lease Backend Selection: Choose lease coordination backend aligned with project philosophy
Session Protocol: Pattern runners communicate lifecycle (startup, heartbeat, shutdown) to admin
Routing Coordination: Admin plane distributes mailbox routing table to all proxies
Query Forwarding: Proxies forward mailbox queries to owning pattern runner
Graceful Shutdown: Pattern runners release leases before terminating

Non-Goals

Data Migration: Moving mailbox data between storage backends
Multi-Region Leases: Cross-cluster lease coordination (see RFC-012)
Backend-Level TTL: Backend-specific expiration (SQLite VACUUM, etc.)
Lease Renewal Strategies: Advanced lease renewal algorithms (constant, backoff, etc.)

Lease Backend Comparison

Option 1: etcd for Lease Management

Architecture:

┌─────────────┐
│ Pattern     │──lease──▶ etcd cluster
│ Runner      │◀─watch───  (distributed KV + lease)
└─────────────┘

Pros:

✅ Purpose-built for distributed coordination
✅ Mature lease management with TTL
✅ Watch API for lease changes
✅ Strong consistency (raft consensus)
✅ Well-documented, battle-tested

Cons:

❌ Additional infrastructure dependency
❌ Complex deployment (3-5 node cluster)
❌ Conflicts with ADR-004 (local-first testing)
❌ Operational burden (monitoring, backups, upgrades)
❌ Network hop for every lease operation
❌ Requires etcd client library in pattern runners

Deployment Complexity: HIGH

# Requires separate etcd cluster
etcd --name etcd-01 --initial-cluster etcd-01=http://...
etcd --name etcd-02 --initial-cluster etcd-02=http://...
etcd --name etcd-03 --initial-cluster etcd-03=http://...

Option 2: Redis for Lease Management

Architecture:

┌─────────────┐
│ Pattern     │──SETEX──▶ Redis
│ Runner      │◀─PUBSUB── (key expiration + pub/sub)
└─────────────┘

Pros:

✅ Simple key expiration (SETEX, EXPIRE)
✅ Pub/sub for lease events
✅ High performance (in-memory)
✅ Already used as backend in Prism
✅ Lightweight single-node deployment

Cons:

❌ Not designed for distributed coordination
❌ No strong consistency guarantees
❌ Key expiration timing not precise
❌ Redis Sentinel/Cluster adds complexity
❌ Split-brain risk without consensus
❌ Additional dependency for lease management
❌ Conflicts with minimal dependencies goal

Deployment Complexity: MEDIUM

# Requires Redis instance (single-node or Sentinel)
redis-server --port 6379
# OR with Sentinel for HA (3+ nodes)

Option 3: Raft-Based Admin Plane (Recommended)

Architecture:

┌─────────────┐
│ Pattern     │──gRPC──▶ prism-admin (raft consensus)
│ Runner      │◀─stream── (control plane protocol)
└─────────────┘           └─▶ SQLite + raft log

Pros:

✅ Zero additional infrastructure (uses existing admin plane)
✅ Raft consensus already in admin for namespace coordination (RFC-047)
✅ Control plane protocol already defined (ADR-055)
✅ Local-first testing (single admin process) (ADR-004)
✅ SQLite storage for lease state (ADR-054)
✅ Heartbeat mechanism already implemented for proxy registration
✅ Aligned with project philosophy (minimal dependencies)
✅ Built-in function of proxy/pattern/admin relationship
✅ Unified monitoring (admin plane metrics cover leases)

Cons:

⚠️ Admin plane becomes critical path for pattern runner startup
⚠️ Lease coordination coupled to admin plane availability
⚠️ Need to implement lease expiration logic in admin

Deployment Complexity: LOW

# No additional infrastructure (admin already runs)
prism-admin --storage-path /data/admin.db
# Pattern runners connect via existing control plane
prism-runner --admin-endpoint admin.prism.local:8981

Decision Matrix

Criteria	etcd	Redis	Raft Admin	Winner
Infrastructure	3-5 nodes	1-3 nodes	0 new nodes	✅ Raft Admin
Deployment Complexity	High	Medium	Low	✅ Raft Admin
Consistency	Strong	Weak	Strong	✅ etcd/Raft Admin
Local-First Testing	❌ Cluster	⚠️ Redis	✅ Single process	✅ Raft Admin
Integration Effort	New client	New client	Existing protocol	✅ Raft Admin
Operational Burden	High	Medium	Low	✅ Raft Admin
Project Alignment	❌ Complex	⚠️ OK	✅ Perfect	✅ Raft Admin
Performance	~10ms	~1ms	~5ms	Redis (but not critical)

Recommendation: Raft-Based Admin Plane

Use raft consensus in prism-admin for mailbox lease management.

Rationale:

Zero New Dependencies: Leverages existing admin plane infrastructure
Simplified Deployment: No separate etcd/Redis cluster to manage
Local-First Testing: Single admin process works for dev/test (ADR-004)
Consistent Architecture: Same mechanism for namespace leases (RFC-047) and mailbox leases
Strong Consistency: Raft provides same guarantees as etcd without operational overhead
Built-In Function: Makes lease management intrinsic to proxy/pattern/admin relationship

Trade-Offs Accepted:

Admin plane becomes critical path (already true for namespace coordination)
Lease coordination performance slightly slower than Redis (~5ms vs ~1ms)
Need to implement lease expiration logic (one-time engineering cost)

Architecture

System Components

┌───────────────────────────────────────────────────────────────────┐
│                         Client Applications                       │
└──────────────────────────┬────────────────────────────────────────┘
                           │ Query mailbox
                           │
┌──────────────────────────▼────────────────────────────────────────┐
│                        Prism Proxy Fleet                          │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐         │
│  │ Proxy-01     │   │ Proxy-02     │   │ Proxy-03     │         │
│  │ (us-east-1a) │   │ (us-east-1b) │   │ (us-west-2a) │         │
│  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘         │
│         │                  │                  │                   │
│         │ Mailbox routing table (from admin)  │                   │
│         │                  │                  │                   │
│  ┌──────▼──────────────────▼──────────────────▼──────┐           │
│  │ Pattern Runners (Mailbox)                         │           │
│  │  ┌───────────────┐  ┌───────────────┐            │           │
│  │  │ mailbox-01    │  │ mailbox-02    │            │           │
│  │  │ (partition 0) │  │ (partition 1) │            │           │
│  │  └───┬───────────┘  └───┬───────────┘            │           │
│  │      │                  │                         │           │
│  │      │ Lease heartbeat  │                         │           │
│  └──────┼──────────────────┼─────────────────────────┘           │
└─────────┼──────────────────┼─────────────────────────────────────┘
          │                  │
          │                  │ gRPC control plane
          │                  │
┌─────────▼──────────────────▼─────────────────────────────────────┐
│                    Prism Admin Plane                              │
│                    (Raft Consensus)                               │
│                                                                    │
│  ┌─────────────────────────────────────────────────────────┐     │
│  │ Mailbox Lease Registry                                  │     │
│  │ ┌─────────────────────────────────────────────────────┐ │     │
│  │ │ mailbox_id         | runner_id  | lease_expires     │ │     │
│  │ │ $admin@proxy-01    | runner-01  | 2025-10-27T10:35  │ │     │
│  │ │ audit-logs@proxy-02| runner-02  | 2025-10-27T10:36  │ │     │
│  │ └─────────────────────────────────────────────────────┘ │     │
│  └─────────────────────────────────────────────────────────┘     │
│                                                                    │
│  ┌─────────────────────────────────────────────────────────┐     │
│  │ Mailbox Routing Table (distributed to proxies)         │     │
│  │ ┌─────────────────────────────────────────────────────┐ │     │
│  │ │ mailbox_id         | proxy_id   | runner_id | addr  │ │     │
│  │ │ $admin@proxy-01    | proxy-01   | runner-01 | :8982 │ │     │
│  │ │ audit-logs@proxy-02| proxy-02   | runner-02 | :8983 │ │     │
│  │ └─────────────────────────────────────────────────────┘ │     │
│  └─────────────────────────────────────────────────────────┘     │
│                                                                    │
│  SQLite Storage (ADR-054) + Raft Log                              │
└────────────────────────────────────────────────────────────────────┘

Routable Identity Format

Each mailbox has a unique routable identity:

mailbox_id = "{namespace}@{proxy_id}"

Examples:
  $admin@proxy-01              → Admin mailbox on proxy-01
  audit-logs@proxy-02          → Audit mailbox on proxy-02
  user-events@proxy-01         → User events mailbox on proxy-01

Components:

namespace: Mailbox namespace (e.g., $admin, audit-logs)
proxy_id: Proxy instance owning the mailbox (e.g., proxy-01)

Routing Table Entry:

type MailboxRoute struct {
    MailboxID string // $admin@proxy-01
    ProxyID   string // proxy-01
    RunnerID  string // runner-01
    Address   string // proxy-01.prism.local:8982
    Partition int32  // 0-255 (from RFC-048)
}

Lease Structure

type MailboxLease struct {
    MailboxID     string    // $admin@proxy-01
    RunnerID      string    // runner-01
    ProxyID       string    // proxy-01
    Partition     int32     // Partition ID (0-255)
    LeaseExpires  time.Time // TTL expiration
    LastHeartbeat time.Time // Last heartbeat from runner
    Address       string    // runner gRPC address
    Status        string    // active, expiring, expired
}

Lease TTL Configuration:

namespaces:
  - name: $admin
    pattern: mailbox
    lease:
      ttl: 300s               # 5 minutes
      heartbeat_interval: 60s # 1 minute
      grace_period: 60s       # 1 minute after TTL

Pattern Runner Lease Protocol

Startup: Acquire Lease

// Pattern runner connects to admin on startup
func (r *MailboxRunner) Start(ctx context.Context) error {
    // 1. Connect to admin control plane
    conn, err := grpc.Dial(r.adminEndpoint)
    adminClient := NewControlPlaneClient(conn)

    // 2. Acquire lease for mailbox
    req := &AcquireLeaseRequest{
        MailboxID: r.mailboxID, // $admin@proxy-01
        RunnerID:  r.runnerID,  // runner-01
        ProxyID:   r.proxyID,   // proxy-01
        Partition: r.partition, // 0
        Address:   r.address,   // proxy-01.prism.local:8982
        TTL:       300,         // 5 minutes
    }

    resp, err := adminClient.AcquireLease(ctx, req)
    if err != nil {
        return fmt.Errorf("failed to acquire lease: %w", err)
    }

    if !resp.Success {
        return fmt.Errorf("lease acquisition denied: %s", resp.Message)
    }

    // 3. Start heartbeat goroutine
    go r.heartbeatLoop(ctx)

    // 4. Start mailbox consumer
    go r.consumeMessages(ctx)

    log.Info("Mailbox runner started",
        "mailbox_id", r.mailboxID,
        "lease_expires", resp.LeaseExpires)

    return nil
}

Runtime: Heartbeat

func (r *MailboxRunner) heartbeatLoop(ctx context.Context) {
    ticker := time.NewTicker(60 * time.Second) // heartbeat_interval
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            err := r.sendHeartbeat(ctx)
            if err != nil {
                log.Error("Heartbeat failed", "error", err)
                // Retry with exponential backoff
            }

        case <-ctx.Done():
            log.Info("Heartbeat loop stopping")
            return
        }
    }
}

func (r *MailboxRunner) sendHeartbeat(ctx context.Context) error {
    req := &MailboxHeartbeat{
        MailboxID:    r.mailboxID,
        RunnerID:     r.runnerID,
        ProxyID:      r.proxyID,
        EventsStored: r.stats.TotalEvents,
        QueriesServed: r.stats.TotalQueries,
        StorageSizeBytes: r.stats.StorageSize,
        Status:       "healthy",
    }

    resp, err := r.adminClient.Heartbeat(ctx, req)
    if err != nil {
        return fmt.Errorf("heartbeat RPC failed: %w", err)
    }

    if !resp.LeaseRenewed {
        log.Warn("Lease not renewed, shutting down",
            "reason", resp.Message)
        r.Shutdown(ctx)
    }

    log.Debug("Heartbeat sent",
        "lease_expires", resp.NewExpiration)

    return nil
}

Shutdown: Release Lease

func (r *MailboxRunner) Shutdown(ctx context.Context) error {
    log.Info("Shutting down mailbox runner", "mailbox_id", r.mailboxID)

    // 1. Stop accepting new messages
    r.consumer.Stop()

    // 2. Flush pending writes
    r.storage.Flush()

    // 3. Release lease
    req := &ReleaseLeaseRequest{
        MailboxID: r.mailboxID,
        RunnerID:  r.runnerID,
    }

    resp, err := r.adminClient.ReleaseLease(ctx, req)
    if err != nil {
        log.Error("Failed to release lease", "error", err)
        // Continue shutdown anyway
    }

    log.Info("Lease released",
        "mailbox_id", r.mailboxID,
        "released", resp.Released)

    return nil
}

Admin Plane Lease Management

Lease Acquisition

func (a *AdminPlane) AcquireLease(ctx context.Context, req *AcquireLeaseRequest) (*AcquireLeaseResponse, error) {
    // 1. Check if lease already exists
    existingLease, err := a.storage.GetLease(req.MailboxID)
    if err != nil && err != ErrNotFound {
        return nil, fmt.Errorf("failed to check lease: %w", err)
    }

    // 2. If lease exists and not expired, deny
    if existingLease != nil && time.Now().Before(existingLease.LeaseExpires) {
        if existingLease.RunnerID != req.RunnerID {
            return &AcquireLeaseResponse{
                Success: false,
                Message: fmt.Sprintf("mailbox already leased by %s", existingLease.RunnerID),
            }, nil
        }
        // Same runner re-acquiring (reconnect case)
    }

    // 3. Create new lease
    lease := &MailboxLease{
        MailboxID:     req.MailboxID,
        RunnerID:      req.RunnerID,
        ProxyID:       req.ProxyID,
        Partition:     req.Partition,
        LeaseExpires:  time.Now().Add(time.Duration(req.TTL) * time.Second),
        LastHeartbeat: time.Now(),
        Address:       req.Address,
        Status:        "active",
    }

    // 4. Store lease via raft consensus
    err = a.raftApply(ctx, &LeaseOperation{
        Type:  "acquire",
        Lease: lease,
    })
    if err != nil {
        return nil, fmt.Errorf("raft apply failed: %w", err)
    }

    // 5. Update routing table and distribute to proxies
    a.updateRoutingTable(lease)
    a.broadcastRoutingUpdate(lease)

    log.Info("Lease acquired",
        "mailbox_id", lease.MailboxID,
        "runner_id", lease.RunnerID,
        "expires", lease.LeaseExpires)

    return &AcquireLeaseResponse{
        Success:      true,
        LeaseExpires: lease.LeaseExpires,
    }, nil
}

Heartbeat Processing

func (a *AdminPlane) ProcessHeartbeat(ctx context.Context, req *MailboxHeartbeat) (*HeartbeatResponse, error) {
    // 1. Lookup lease
    lease, err := a.storage.GetLease(req.MailboxID)
    if err != nil {
        return &HeartbeatResponse{
            LeaseRenewed: false,
            Message:      "lease not found",
        }, nil
    }

    // 2. Verify runner owns lease
    if lease.RunnerID != req.RunnerID {
        return &HeartbeatResponse{
            LeaseRenewed: false,
            Message:      fmt.Sprintf("lease owned by %s", lease.RunnerID),
        }, nil
    }

    // 3. Check if lease expired (beyond grace period)
    gracePeriod := 60 * time.Second
    if time.Now().After(lease.LeaseExpires.Add(gracePeriod)) {
        return &HeartbeatResponse{
            LeaseRenewed: false,
            Message:      "lease expired beyond grace period",
        }, nil
    }

    // 4. Renew lease
    lease.LeaseExpires = time.Now().Add(300 * time.Second) // TTL
    lease.LastHeartbeat = time.Now()
    lease.Status = "active"

    // 5. Apply via raft
    err = a.raftApply(ctx, &LeaseOperation{
        Type:  "renew",
        Lease: lease,
    })
    if err != nil {
        return nil, fmt.Errorf("raft apply failed: %w", err)
    }

    log.Debug("Lease renewed",
        "mailbox_id", lease.MailboxID,
        "new_expiration", lease.LeaseExpires)

    return &HeartbeatResponse{
        LeaseRenewed:  true,
        NewExpiration: lease.LeaseExpires,
    }, nil
}

Lease Expiration Background Job

func (a *AdminPlane) leaseExpirationLoop(ctx context.Context) {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            a.expireStaleLeases(ctx)

        case <-ctx.Done():
            return
        }
    }
}

func (a *AdminPlane) expireStaleLeases(ctx context.Context) {
    leases, err := a.storage.GetAllLeases()
    if err != nil {
        log.Error("Failed to fetch leases", "error", err)
        return
    }

    now := time.Now()
    gracePeriod := 60 * time.Second

    for _, lease := range leases {
        if now.After(lease.LeaseExpires.Add(gracePeriod)) {
            log.Warn("Expiring stale lease",
                "mailbox_id", lease.MailboxID,
                "runner_id", lease.RunnerID,
                "last_heartbeat", lease.LastHeartbeat)

            // Mark as expired via raft
            lease.Status = "expired"
            err := a.raftApply(ctx, &LeaseOperation{
                Type:  "expire",
                Lease: lease,
            })
            if err != nil {
                log.Error("Failed to expire lease", "error", err)
                continue
            }

            // Remove from routing table
            a.removeFromRoutingTable(lease.MailboxID)
            a.broadcastRoutingRemoval(lease.MailboxID)
        }
    }
}

Routing Coordination

Routing Table Distribution

// Admin broadcasts routing updates to all proxies
func (a *AdminPlane) broadcastRoutingUpdate(lease *MailboxLease) {
    route := &MailboxRoute{
        MailboxID: lease.MailboxID,
        ProxyID:   lease.ProxyID,
        RunnerID:  lease.RunnerID,
        Address:   lease.Address,
        Partition: lease.Partition,
    }

    // Send to all registered proxies
    for _, proxy := range a.proxies.GetAll() {
        go func(p *ProxyRegistration) {
            err := p.client.UpdateMailboxRoute(context.Background(), route)
            if err != nil {
                log.Error("Failed to update routing",
                    "proxy_id", p.ProxyID,
                    "mailbox_id", route.MailboxID,
                    "error", err)
            }
        }(proxy)
    }
}

Proxy Routing Table

type ProxyRoutingTable struct {
    mu     sync.RWMutex
    routes map[string]*MailboxRoute // mailbox_id -> route
}

func (p *Proxy) UpdateMailboxRoute(ctx context.Context, route *MailboxRoute) error {
    p.routingTable.mu.Lock()
    defer p.routingTable.mu.Unlock()

    p.routingTable.routes[route.MailboxID] = route

    log.Info("Routing table updated",
        "mailbox_id", route.MailboxID,
        "proxy_id", route.ProxyID,
        "address", route.Address)

    return nil
}

func (p *Proxy) RouteMailboxQuery(ctx context.Context, mailboxID string) (*MailboxRoute, error) {
    p.routingTable.mu.RLock()
    defer p.routingTable.mu.RUnlock()

    route, exists := p.routingTable.routes[mailboxID]
    if !exists {
        return nil, fmt.Errorf("mailbox not found: %s", mailboxID)
    }

    return route, nil
}

Query Forwarding

// Client queries mailbox through any proxy
func (p *Proxy) QueryMailbox(ctx context.Context, req *QueryMailboxRequest) (*QueryMailboxResponse, error) {
    // 1. Lookup mailbox in routing table
    route, err := p.RouteMailboxQuery(ctx, req.MailboxID)
    if err != nil {
        return nil, status.Errorf(codes.NotFound, "mailbox not found: %s", req.MailboxID)
    }

    // 2. Check if this proxy owns the mailbox
    if route.ProxyID == p.proxyID {
        // Local query - forward to local runner
        runner, err := p.getRunner(route.RunnerID)
        if err != nil {
            return nil, status.Errorf(codes.Internal, "runner not found: %s", route.RunnerID)
        }

        return runner.Query(ctx, req)
    }

    // 3. Remote query - forward to owning proxy
    log.Debug("Forwarding mailbox query",
        "mailbox_id", req.MailboxID,
        "from_proxy", p.proxyID,
        "to_proxy", route.ProxyID,
        "address", route.Address)

    proxyConn, err := p.getProxyConnection(route.Address)
    if err != nil {
        return nil, status.Errorf(codes.Unavailable, "failed to connect to proxy: %v", err)
    }

    proxyClient := NewProxyServiceClient(proxyConn)
    return proxyClient.QueryMailbox(ctx, req)
}

Alignment with RFC-048

Partition-Based Mailbox Assignment

Mailboxes are assigned to partitions using the same consistent hashing from RFC-048:

// Compute partition for mailbox (same as namespace partitioning)
func ComputeMailboxPartition(mailboxID string) int32 {
    hash := crc32.ChecksumIEEE([]byte(mailboxID))
    return int32(hash % 256) // 0-255
}

// Example:
// $admin@proxy-01 → hash=42 → partition=42
// audit-logs@proxy-02 → hash=155 → partition=155

Proxy Partition Ranges

Admin assigns partition ranges to proxies (RFC-048):

Proxy-01: partitions [0-63]   → owns mailboxes in this range
Proxy-02: partitions [64-127] → owns mailboxes in this range
Proxy-03: partitions [128-191] → owns mailboxes in this range
Proxy-04: partitions [192-255] → owns mailboxes in this range

Namespace Configuration with Partition

namespaces:
  - name: $admin
    pattern: mailbox
    partition_strategy: consistent_hash  # From RFC-048
    assigned_partition: 42              # Computed from mailbox_id
    assigned_proxy: proxy-01            # Owns partition 42
    lease:
      ttl: 300s

Storage Schema (ADR-054)

Mailbox Leases Table

CREATE TABLE IF NOT EXISTS mailbox_leases (
    mailbox_id TEXT PRIMARY KEY,
    runner_id TEXT NOT NULL,
    proxy_id TEXT NOT NULL,
    partition INTEGER NOT NULL,
    lease_expires INTEGER NOT NULL,
    last_heartbeat INTEGER NOT NULL,
    address TEXT NOT NULL,
    status TEXT NOT NULL,
    created_at INTEGER NOT NULL,
    updated_at INTEGER NOT NULL
);

CREATE INDEX idx_lease_expires ON mailbox_leases(lease_expires);
CREATE INDEX idx_status ON mailbox_leases(status);
CREATE INDEX idx_proxy_partition ON mailbox_leases(proxy_id, partition);

Mailbox Routing Table

CREATE TABLE IF NOT EXISTS mailbox_routes (
    mailbox_id TEXT PRIMARY KEY,
    proxy_id TEXT NOT NULL,
    runner_id TEXT NOT NULL,
    address TEXT NOT NULL,
    partition INTEGER NOT NULL,
    created_at INTEGER NOT NULL,
    updated_at INTEGER NOT NULL,

    FOREIGN KEY (mailbox_id) REFERENCES mailbox_leases(mailbox_id)
);

CREATE INDEX idx_proxy_id ON mailbox_routes(proxy_id);
CREATE INDEX idx_partition ON mailbox_routes(partition);

Protobuf Protocol

syntax = "proto3";

package prism.admin.v1;

service ControlPlane {
  // Mailbox lease management
  rpc AcquireLease(AcquireLeaseRequest) returns (AcquireLeaseResponse);
  rpc ReleaseLease(ReleaseLeaseRequest) returns (ReleaseLeaseResponse);
  rpc MailboxHeartbeat(MailboxHeartbeatRequest) returns (MailboxHeartbeatResponse);

  // Routing table updates (admin → proxy)
  rpc UpdateMailboxRoute(MailboxRouteUpdate) returns (MailboxRouteUpdateAck);
  rpc RemoveMailboxRoute(MailboxRouteRemoval) returns (MailboxRouteRemovalAck);
}

message AcquireLeaseRequest {
  string mailbox_id = 1;  // $admin@proxy-01
  string runner_id = 2;   // runner-01
  string proxy_id = 3;    // proxy-01
  int32 partition = 4;    // 0-255
  string address = 5;     // proxy-01.prism.local:8982
  int64 ttl_seconds = 6;  // 300
}

message AcquireLeaseResponse {
  bool success = 1;
  string message = 2;
  google.protobuf.Timestamp lease_expires = 3;
}

message ReleaseLeaseRequest {
  string mailbox_id = 1;
  string runner_id = 2;
}

message ReleaseLeaseResponse {
  bool released = 1;
  string message = 2;
}

message MailboxHeartbeatRequest {
  string mailbox_id = 1;
  string runner_id = 2;
  string proxy_id = 3;
  int64 events_stored = 4;
  int64 queries_served = 5;
  int64 storage_size_bytes = 6;
  string status = 7; // healthy, degraded
}

message MailboxHeartbeatResponse {
  bool lease_renewed = 1;
  string message = 2;
  google.protobuf.Timestamp new_expiration = 3;
}

message MailboxRouteUpdate {
  string mailbox_id = 1;
  string proxy_id = 2;
  string runner_id = 3;
  string address = 4;
  int32 partition = 5;
}

message MailboxRouteUpdateAck {
  bool success = 1;
}

message MailboxRouteRemoval {
  string mailbox_id = 1;
}

message MailboxRouteRemovalAck {
  bool success = 1;
}

Local-First Testing (ADR-004)

Single Admin Process

# Start local admin (no raft cluster needed for dev)
prism-admin --storage-path /tmp/admin-test.db --standalone

# Start proxy with pattern runner
prism-proxy --admin-endpoint localhost:8981 --proxy-id proxy-01

# Pattern runner acquires lease from local admin
# No etcd, no Redis, no external dependencies

Test Configuration

# dev/test config
admin:
  mode: standalone  # Single node, no raft replication
  storage_path: /tmp/admin-test.db

mailbox:
  lease:
    ttl: 60s       # Shorter TTL for faster tests
    heartbeat_interval: 10s
    grace_period: 10s

Migration Path

Phase 1: Basic Lease Management (Week 1-2)

Implement AcquireLease, ReleaseLease, Heartbeat RPCs in admin
Add mailbox_leases table to SQLite schema
Pattern runner lease acquisition on startup
Basic lease expiration background job

Phase 2: Routing Coordination (Week 3)

Add mailbox_routes table
Implement routing table distribution (admin → proxy)
Proxy routing table updates
Query forwarding logic

Phase 3: Production Hardening (Week 4)

Raft consensus for lease operations
Lease renewal backoff strategies
Graceful shutdown with lease release
Metrics and monitoring

Phase 4: Advanced Features (Future)

Lease rebalancing on proxy join/leave
Lease transfer between runners
Multi-region lease coordination (RFC-012)

Trade-Offs and Alternatives

Why Not etcd?

Trade-Off: Operational complexity vs purpose-built coordination

Decision: Operational simplicity > specialized tool

Rationale:

Prism targets single-tenant and small-scale deployments (10-100 proxies)
Admin plane already provides raft consensus
etcd adds 3-5 more nodes to manage, monitor, backup
Conflicts with local-first testing philosophy

Why Not Redis?

Trade-Off: Performance vs consistency guarantees

Decision: Strong consistency > millisecond latency

Rationale:

Lease conflicts cause split-brain (multiple runners own same mailbox)
Redis key expiration timing not precise enough
Sentinel/Cluster adds complexity comparable to etcd
Already using Redis as data backend, not control plane

Why Raft in Admin?

Trade-Off: Admin as critical path vs unified architecture

Decision: Accept admin dependency for architectural simplicity

Rationale:

Admin already critical for namespace coordination (RFC-047)
Raft consensus provides same guarantees as etcd
Zero new infrastructure dependencies
Unified monitoring and operations

Success Criteria

✅ Pattern runner acquires lease on startup (<100ms)
✅ Lease heartbeats every 60s with <50ms latency
✅ Expired leases cleaned up within grace period (60s)
✅ Routing table distributed to all proxies (<1s)
✅ Mailbox queries route to correct runner (99.9% success)
✅ Local-first testing works with single admin process
✅ Graceful shutdown releases lease within 5s
✅ Admin plane handles 100 concurrent pattern runners

Open Questions

Lease Transfer: Should leases be transferable between runners for rebalancing?
- Initial Answer: No, expire and re-acquire instead (simpler)
Lease Priority: Should some mailboxes have longer TTLs than others?
- Initial Answer: Yes, configure per-namespace lease policy
Multi-Region Leases: How to coordinate leases across regions?
- Defer: See RFC-012 for multi-cluster coordination
Lease Conflict Resolution: What if two runners claim same mailbox?
- Answer: Raft ensures only one lease granted (linearizable writes)
Pattern Runner Auto-Scaling: Should admin spawn runners automatically?
- Defer: Phase 4 feature, manual runner deployment initially

References

RFC-037: Mailbox Pattern - Searchable Event Store
RFC-047: Cross-Proxy Namespace Reservation with Lease Management
RFC-048: Cross-Proxy Partition Strategies and Request Forwarding
ADR-055: Proxy-Admin Control Plane Protocol
ADR-054: Prism-Admin SQLite Storage
ADR-004: Local-First Testing Strategy

Revision History

2025-10-27 (v1): Initial draft - Mailbox lifecycle, lease management, routing coordination with raft-based admin recommendation

Abstract​

Motivation​

Problem Statement​

Goals​

Non-Goals​

Lease Backend Comparison​

Option 1: etcd for Lease Management​

Option 2: Redis for Lease Management​

Option 3: Raft-Based Admin Plane (Recommended)​

Decision Matrix​

Recommendation: Raft-Based Admin Plane​

Architecture​

System Components​

Routable Identity Format​

Lease Structure​

Pattern Runner Lease Protocol​

Startup: Acquire Lease​

Runtime: Heartbeat​

Shutdown: Release Lease​

Admin Plane Lease Management​

Lease Acquisition​

Heartbeat Processing​

Lease Expiration Background Job​

Routing Coordination​

Routing Table Distribution​

Proxy Routing Table​

Query Forwarding​

Alignment with RFC-048​

Partition-Based Mailbox Assignment​

Proxy Partition Ranges​

Namespace Configuration with Partition​

Storage Schema (ADR-054)​

Mailbox Leases Table​

Mailbox Routing Table​

Protobuf Protocol​

Local-First Testing (ADR-004)​

Single Admin Process​

Test Configuration​

Migration Path​

Phase 1: Basic Lease Management (Week 1-2)​

Phase 2: Routing Coordination (Week 3)​

Phase 3: Production Hardening (Week 4)​

Phase 4: Advanced Features (Future)​

Trade-Offs and Alternatives​

Why Not etcd?​

Why Not Redis?​

Why Raft in Admin?​

Success Criteria​

Open Questions​

References​

Revision History​

Abstract

Motivation

Problem Statement

Goals

Non-Goals

Lease Backend Comparison

Option 1: etcd for Lease Management

Option 2: Redis for Lease Management

Option 3: Raft-Based Admin Plane (Recommended)

Decision Matrix

Recommendation: Raft-Based Admin Plane

Architecture

System Components

Routable Identity Format

Lease Structure

Pattern Runner Lease Protocol

Startup: Acquire Lease

Runtime: Heartbeat

Shutdown: Release Lease

Admin Plane Lease Management

Lease Acquisition

Heartbeat Processing

Lease Expiration Background Job

Routing Coordination

Routing Table Distribution

Proxy Routing Table

Query Forwarding

Alignment with RFC-048

Partition-Based Mailbox Assignment

Proxy Partition Ranges

Namespace Configuration with Partition

Storage Schema (ADR-054)

Mailbox Leases Table

Mailbox Routing Table

Protobuf Protocol

Local-First Testing (ADR-004)

Single Admin Process

Test Configuration

Migration Path

Phase 1: Basic Lease Management (Week 1-2)

Phase 2: Routing Coordination (Week 3)

Phase 3: Production Hardening (Week 4)

Phase 4: Advanced Features (Future)

Trade-Offs and Alternatives

Why Not etcd?

Why Not Redis?

Why Raft in Admin?

Success Criteria

Open Questions

References

Revision History