Skip to main content

RFC-049: Mailbox Lifecycle, Lease Management, and Routing Coordination

Abstract

This RFC specifies lifecycle management, lease coordination, and routing for the Mailbox Pattern (RFC-037). Pattern runners must establish both a routable identity and TTL-based lease for their mailbox instances. The RFC explores three lease backend options (etcd, Redis, raft-based admin) and recommends raft-based admin storage to simplify deployment and align with project philosophies. Pattern runners hold leases via the admin plane, communicate their availability through session protocols, and enable proxies to route mailbox queries to the correct backend.

Key Benefits:

  • Simplified Deployment: No external etcd or lease infrastructure required
  • Built-In Lease Management: Raft consensus in admin plane provides distributed coordination
  • Routable Identity: Each mailbox has discoverable address for query routing
  • TTL-Based Lifecycle: Automatic cleanup of abandoned mailboxes
  • Cross-Proxy Coordination: Admin plane shares mailbox routing info with all proxies
  • Aligned with Project Philosophy: Local-first testing, minimal dependencies (ADR-004)

Motivation

Problem Statement

The Mailbox Pattern (RFC-037) stores events in searchable storage but lacks lifecycle management:

Problem 1: No Routable Identity

  • Clients don't know which backend instance holds their mailbox
  • Queries may target wrong proxy/pattern-runner
  • No service discovery for mailbox locations

Problem 2: No TTL Management

  • Mailboxes created but never queried remain indefinitely
  • No mechanism to expire inactive mailboxes
  • Resource waste from abandoned storage

Problem 3: No Lease Coordination

  • Pattern runners don't hold leases for their mailbox assignments
  • Multiple runners could claim same mailbox (split-brain)
  • No graceful shutdown mechanism

Problem 4: No Routing Infrastructure

  • Proxies don't know which proxy/runner owns a mailbox
  • Cannot forward mailbox queries to correct backend
  • Mailbox query requests fail or route incorrectly

Problem 5: Pattern Runner Backend Dependencies

  • Pattern runners need distributed coordination
  • Options: etcd, Redis, raft-based admin
  • Unclear which backend aligns with project goals

Goals

  1. Routable Identity: Each mailbox has discoverable address (proxy_id + runner_id + partition_id)
  2. TTL-Based Leases: Pattern runners hold leases for mailbox assignments with automatic expiration
  3. Lease Backend Selection: Choose lease coordination backend aligned with project philosophy
  4. Session Protocol: Pattern runners communicate lifecycle (startup, heartbeat, shutdown) to admin
  5. Routing Coordination: Admin plane distributes mailbox routing table to all proxies
  6. Query Forwarding: Proxies forward mailbox queries to owning pattern runner
  7. Graceful Shutdown: Pattern runners release leases before terminating

Non-Goals

  • Data Migration: Moving mailbox data between storage backends
  • Multi-Region Leases: Cross-cluster lease coordination (see RFC-012)
  • Backend-Level TTL: Backend-specific expiration (SQLite VACUUM, etc.)
  • Lease Renewal Strategies: Advanced lease renewal algorithms (constant, backoff, etc.)

Lease Backend Comparison

Option 1: etcd for Lease Management

Architecture:

┌─────────────┐
│ Pattern │──lease──▶ etcd cluster
│ Runner │◀─watch─── (distributed KV + lease)
└─────────────┘

Pros:

  • ✅ Purpose-built for distributed coordination
  • ✅ Mature lease management with TTL
  • ✅ Watch API for lease changes
  • ✅ Strong consistency (raft consensus)
  • ✅ Well-documented, battle-tested

Cons:

  • ❌ Additional infrastructure dependency
  • ❌ Complex deployment (3-5 node cluster)
  • ❌ Conflicts with ADR-004 (local-first testing)
  • ❌ Operational burden (monitoring, backups, upgrades)
  • ❌ Network hop for every lease operation
  • ❌ Requires etcd client library in pattern runners

Deployment Complexity: HIGH

# Requires separate etcd cluster
etcd --name etcd-01 --initial-cluster etcd-01=http://...
etcd --name etcd-02 --initial-cluster etcd-02=http://...
etcd --name etcd-03 --initial-cluster etcd-03=http://...

Option 2: Redis for Lease Management

Architecture:

┌─────────────┐
│ Pattern │──SETEX──▶ Redis
│ Runner │◀─PUBSUB── (key expiration + pub/sub)
└─────────────┘

Pros:

  • ✅ Simple key expiration (SETEX, EXPIRE)
  • ✅ Pub/sub for lease events
  • ✅ High performance (in-memory)
  • ✅ Already used as backend in Prism
  • ✅ Lightweight single-node deployment

Cons:

  • ❌ Not designed for distributed coordination
  • ❌ No strong consistency guarantees
  • ❌ Key expiration timing not precise
  • ❌ Redis Sentinel/Cluster adds complexity
  • ❌ Split-brain risk without consensus
  • ❌ Additional dependency for lease management
  • ❌ Conflicts with minimal dependencies goal

Deployment Complexity: MEDIUM

# Requires Redis instance (single-node or Sentinel)
redis-server --port 6379
# OR with Sentinel for HA (3+ nodes)

Architecture:

┌─────────────┐
│ Pattern │──gRPC──▶ prism-admin (raft consensus)
│ Runner │◀─stream── (control plane protocol)
└─────────────┘ └─▶ SQLite + raft log

Pros:

  • Zero additional infrastructure (uses existing admin plane)
  • Raft consensus already in admin for namespace coordination (RFC-047)
  • Control plane protocol already defined (ADR-055)
  • Local-first testing (single admin process) (ADR-004)
  • SQLite storage for lease state (ADR-054)
  • Heartbeat mechanism already implemented for proxy registration
  • Aligned with project philosophy (minimal dependencies)
  • Built-in function of proxy/pattern/admin relationship
  • Unified monitoring (admin plane metrics cover leases)

Cons:

  • ⚠️ Admin plane becomes critical path for pattern runner startup
  • ⚠️ Lease coordination coupled to admin plane availability
  • ⚠️ Need to implement lease expiration logic in admin

Deployment Complexity: LOW

# No additional infrastructure (admin already runs)
prism-admin --storage-path /data/admin.db
# Pattern runners connect via existing control plane
prism-runner --admin-endpoint admin.prism.local:8981

Decision Matrix

CriteriaetcdRedisRaft AdminWinner
Infrastructure3-5 nodes1-3 nodes0 new nodes✅ Raft Admin
Deployment ComplexityHighMediumLow✅ Raft Admin
ConsistencyStrongWeakStrong✅ etcd/Raft Admin
Local-First Testing❌ Cluster⚠️ Redis✅ Single process✅ Raft Admin
Integration EffortNew clientNew clientExisting protocol✅ Raft Admin
Operational BurdenHighMediumLow✅ Raft Admin
Project Alignment❌ Complex⚠️ OK✅ Perfect✅ Raft Admin
Performance~10ms~1ms~5msRedis (but not critical)

Recommendation: Raft-Based Admin Plane

Use raft consensus in prism-admin for mailbox lease management.

Rationale:

  1. Zero New Dependencies: Leverages existing admin plane infrastructure
  2. Simplified Deployment: No separate etcd/Redis cluster to manage
  3. Local-First Testing: Single admin process works for dev/test (ADR-004)
  4. Consistent Architecture: Same mechanism for namespace leases (RFC-047) and mailbox leases
  5. Strong Consistency: Raft provides same guarantees as etcd without operational overhead
  6. Built-In Function: Makes lease management intrinsic to proxy/pattern/admin relationship

Trade-Offs Accepted:

  • Admin plane becomes critical path (already true for namespace coordination)
  • Lease coordination performance slightly slower than Redis (~5ms vs ~1ms)
  • Need to implement lease expiration logic (one-time engineering cost)

Architecture

System Components

┌───────────────────────────────────────────────────────────────────┐
│ Client Applications │
└──────────────────────────┬────────────────────────────────────────┘
│ Query mailbox

┌──────────────────────────▼────────────────────────────────────────┐
│ Prism Proxy Fleet │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Proxy-01 │ │ Proxy-02 │ │ Proxy-03 │ │
│ │ (us-east-1a) │ │ (us-east-1b) │ │ (us-west-2a) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ Mailbox routing table (from admin) │ │
│ │ │ │ │
│ ┌──────▼──────────────────▼──────────────────▼──────┐ │
│ │ Pattern Runners (Mailbox) │ │
│ │ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ mailbox-01 │ │ mailbox-02 │ │ │
│ │ │ (partition 0) │ │ (partition 1) │ │ │
│ │ └───┬───────────┘ └───┬───────────┘ │ │
│ │ │ │ │ │
│ │ │ Lease heartbeat │ │ │
│ └──────┼──────────────────┼─────────────────────────┘ │
└─────────┼──────────────────┼─────────────────────────────────────┘
│ │
│ │ gRPC control plane
│ │
┌─────────▼──────────────────▼─────────────────────────────────────┐
│ Prism Admin Plane │
│ (Raft Consensus) │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Mailbox Lease Registry │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ mailbox_id | runner_id | lease_expires │ │ │
│ │ │ $admin@proxy-01 | runner-01 | 2025-10-27T10:35 │ │ │
│ │ │ audit-logs@proxy-02| runner-02 | 2025-10-27T10:36 │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Mailbox Routing Table (distributed to proxies) │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ mailbox_id | proxy_id | runner_id | addr │ │ │
│ │ │ $admin@proxy-01 | proxy-01 | runner-01 | :8982 │ │ │
│ │ │ audit-logs@proxy-02| proxy-02 | runner-02 | :8983 │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ SQLite Storage (ADR-054) + Raft Log │
└────────────────────────────────────────────────────────────────────┘

Routable Identity Format

Each mailbox has a unique routable identity:

mailbox_id = "{namespace}@{proxy_id}"

Examples:
$admin@proxy-01 → Admin mailbox on proxy-01
audit-logs@proxy-02 → Audit mailbox on proxy-02
user-events@proxy-01 → User events mailbox on proxy-01

Components:

  • namespace: Mailbox namespace (e.g., $admin, audit-logs)
  • proxy_id: Proxy instance owning the mailbox (e.g., proxy-01)

Routing Table Entry:

type MailboxRoute struct {
MailboxID string // $admin@proxy-01
ProxyID string // proxy-01
RunnerID string // runner-01
Address string // proxy-01.prism.local:8982
Partition int32 // 0-255 (from RFC-048)
}

Lease Structure

type MailboxLease struct {
MailboxID string // $admin@proxy-01
RunnerID string // runner-01
ProxyID string // proxy-01
Partition int32 // Partition ID (0-255)
LeaseExpires time.Time // TTL expiration
LastHeartbeat time.Time // Last heartbeat from runner
Address string // runner gRPC address
Status string // active, expiring, expired
}

Lease TTL Configuration:

namespaces:
- name: $admin
pattern: mailbox
lease:
ttl: 300s # 5 minutes
heartbeat_interval: 60s # 1 minute
grace_period: 60s # 1 minute after TTL

Pattern Runner Lease Protocol

Startup: Acquire Lease

// Pattern runner connects to admin on startup
func (r *MailboxRunner) Start(ctx context.Context) error {
// 1. Connect to admin control plane
conn, err := grpc.Dial(r.adminEndpoint)
adminClient := NewControlPlaneClient(conn)

// 2. Acquire lease for mailbox
req := &AcquireLeaseRequest{
MailboxID: r.mailboxID, // $admin@proxy-01
RunnerID: r.runnerID, // runner-01
ProxyID: r.proxyID, // proxy-01
Partition: r.partition, // 0
Address: r.address, // proxy-01.prism.local:8982
TTL: 300, // 5 minutes
}

resp, err := adminClient.AcquireLease(ctx, req)
if err != nil {
return fmt.Errorf("failed to acquire lease: %w", err)
}

if !resp.Success {
return fmt.Errorf("lease acquisition denied: %s", resp.Message)
}

// 3. Start heartbeat goroutine
go r.heartbeatLoop(ctx)

// 4. Start mailbox consumer
go r.consumeMessages(ctx)

log.Info("Mailbox runner started",
"mailbox_id", r.mailboxID,
"lease_expires", resp.LeaseExpires)

return nil
}

Runtime: Heartbeat

func (r *MailboxRunner) heartbeatLoop(ctx context.Context) {
ticker := time.NewTicker(60 * time.Second) // heartbeat_interval
defer ticker.Stop()

for {
select {
case <-ticker.C:
err := r.sendHeartbeat(ctx)
if err != nil {
log.Error("Heartbeat failed", "error", err)
// Retry with exponential backoff
}

case <-ctx.Done():
log.Info("Heartbeat loop stopping")
return
}
}
}

func (r *MailboxRunner) sendHeartbeat(ctx context.Context) error {
req := &MailboxHeartbeat{
MailboxID: r.mailboxID,
RunnerID: r.runnerID,
ProxyID: r.proxyID,
EventsStored: r.stats.TotalEvents,
QueriesServed: r.stats.TotalQueries,
StorageSizeBytes: r.stats.StorageSize,
Status: "healthy",
}

resp, err := r.adminClient.Heartbeat(ctx, req)
if err != nil {
return fmt.Errorf("heartbeat RPC failed: %w", err)
}

if !resp.LeaseRenewed {
log.Warn("Lease not renewed, shutting down",
"reason", resp.Message)
r.Shutdown(ctx)
}

log.Debug("Heartbeat sent",
"lease_expires", resp.NewExpiration)

return nil
}

Shutdown: Release Lease

func (r *MailboxRunner) Shutdown(ctx context.Context) error {
log.Info("Shutting down mailbox runner", "mailbox_id", r.mailboxID)

// 1. Stop accepting new messages
r.consumer.Stop()

// 2. Flush pending writes
r.storage.Flush()

// 3. Release lease
req := &ReleaseLeaseRequest{
MailboxID: r.mailboxID,
RunnerID: r.runnerID,
}

resp, err := r.adminClient.ReleaseLease(ctx, req)
if err != nil {
log.Error("Failed to release lease", "error", err)
// Continue shutdown anyway
}

log.Info("Lease released",
"mailbox_id", r.mailboxID,
"released", resp.Released)

return nil
}

Admin Plane Lease Management

Lease Acquisition

func (a *AdminPlane) AcquireLease(ctx context.Context, req *AcquireLeaseRequest) (*AcquireLeaseResponse, error) {
// 1. Check if lease already exists
existingLease, err := a.storage.GetLease(req.MailboxID)
if err != nil && err != ErrNotFound {
return nil, fmt.Errorf("failed to check lease: %w", err)
}

// 2. If lease exists and not expired, deny
if existingLease != nil && time.Now().Before(existingLease.LeaseExpires) {
if existingLease.RunnerID != req.RunnerID {
return &AcquireLeaseResponse{
Success: false,
Message: fmt.Sprintf("mailbox already leased by %s", existingLease.RunnerID),
}, nil
}
// Same runner re-acquiring (reconnect case)
}

// 3. Create new lease
lease := &MailboxLease{
MailboxID: req.MailboxID,
RunnerID: req.RunnerID,
ProxyID: req.ProxyID,
Partition: req.Partition,
LeaseExpires: time.Now().Add(time.Duration(req.TTL) * time.Second),
LastHeartbeat: time.Now(),
Address: req.Address,
Status: "active",
}

// 4. Store lease via raft consensus
err = a.raftApply(ctx, &LeaseOperation{
Type: "acquire",
Lease: lease,
})
if err != nil {
return nil, fmt.Errorf("raft apply failed: %w", err)
}

// 5. Update routing table and distribute to proxies
a.updateRoutingTable(lease)
a.broadcastRoutingUpdate(lease)

log.Info("Lease acquired",
"mailbox_id", lease.MailboxID,
"runner_id", lease.RunnerID,
"expires", lease.LeaseExpires)

return &AcquireLeaseResponse{
Success: true,
LeaseExpires: lease.LeaseExpires,
}, nil
}

Heartbeat Processing

func (a *AdminPlane) ProcessHeartbeat(ctx context.Context, req *MailboxHeartbeat) (*HeartbeatResponse, error) {
// 1. Lookup lease
lease, err := a.storage.GetLease(req.MailboxID)
if err != nil {
return &HeartbeatResponse{
LeaseRenewed: false,
Message: "lease not found",
}, nil
}

// 2. Verify runner owns lease
if lease.RunnerID != req.RunnerID {
return &HeartbeatResponse{
LeaseRenewed: false,
Message: fmt.Sprintf("lease owned by %s", lease.RunnerID),
}, nil
}

// 3. Check if lease expired (beyond grace period)
gracePeriod := 60 * time.Second
if time.Now().After(lease.LeaseExpires.Add(gracePeriod)) {
return &HeartbeatResponse{
LeaseRenewed: false,
Message: "lease expired beyond grace period",
}, nil
}

// 4. Renew lease
lease.LeaseExpires = time.Now().Add(300 * time.Second) // TTL
lease.LastHeartbeat = time.Now()
lease.Status = "active"

// 5. Apply via raft
err = a.raftApply(ctx, &LeaseOperation{
Type: "renew",
Lease: lease,
})
if err != nil {
return nil, fmt.Errorf("raft apply failed: %w", err)
}

log.Debug("Lease renewed",
"mailbox_id", lease.MailboxID,
"new_expiration", lease.LeaseExpires)

return &HeartbeatResponse{
LeaseRenewed: true,
NewExpiration: lease.LeaseExpires,
}, nil
}

Lease Expiration Background Job

func (a *AdminPlane) leaseExpirationLoop(ctx context.Context) {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()

for {
select {
case <-ticker.C:
a.expireStaleLeases(ctx)

case <-ctx.Done():
return
}
}
}

func (a *AdminPlane) expireStaleLeases(ctx context.Context) {
leases, err := a.storage.GetAllLeases()
if err != nil {
log.Error("Failed to fetch leases", "error", err)
return
}

now := time.Now()
gracePeriod := 60 * time.Second

for _, lease := range leases {
if now.After(lease.LeaseExpires.Add(gracePeriod)) {
log.Warn("Expiring stale lease",
"mailbox_id", lease.MailboxID,
"runner_id", lease.RunnerID,
"last_heartbeat", lease.LastHeartbeat)

// Mark as expired via raft
lease.Status = "expired"
err := a.raftApply(ctx, &LeaseOperation{
Type: "expire",
Lease: lease,
})
if err != nil {
log.Error("Failed to expire lease", "error", err)
continue
}

// Remove from routing table
a.removeFromRoutingTable(lease.MailboxID)
a.broadcastRoutingRemoval(lease.MailboxID)
}
}
}

Routing Coordination

Routing Table Distribution

// Admin broadcasts routing updates to all proxies
func (a *AdminPlane) broadcastRoutingUpdate(lease *MailboxLease) {
route := &MailboxRoute{
MailboxID: lease.MailboxID,
ProxyID: lease.ProxyID,
RunnerID: lease.RunnerID,
Address: lease.Address,
Partition: lease.Partition,
}

// Send to all registered proxies
for _, proxy := range a.proxies.GetAll() {
go func(p *ProxyRegistration) {
err := p.client.UpdateMailboxRoute(context.Background(), route)
if err != nil {
log.Error("Failed to update routing",
"proxy_id", p.ProxyID,
"mailbox_id", route.MailboxID,
"error", err)
}
}(proxy)
}
}

Proxy Routing Table

type ProxyRoutingTable struct {
mu sync.RWMutex
routes map[string]*MailboxRoute // mailbox_id -> route
}

func (p *Proxy) UpdateMailboxRoute(ctx context.Context, route *MailboxRoute) error {
p.routingTable.mu.Lock()
defer p.routingTable.mu.Unlock()

p.routingTable.routes[route.MailboxID] = route

log.Info("Routing table updated",
"mailbox_id", route.MailboxID,
"proxy_id", route.ProxyID,
"address", route.Address)

return nil
}

func (p *Proxy) RouteMailboxQuery(ctx context.Context, mailboxID string) (*MailboxRoute, error) {
p.routingTable.mu.RLock()
defer p.routingTable.mu.RUnlock()

route, exists := p.routingTable.routes[mailboxID]
if !exists {
return nil, fmt.Errorf("mailbox not found: %s", mailboxID)
}

return route, nil
}

Query Forwarding

// Client queries mailbox through any proxy
func (p *Proxy) QueryMailbox(ctx context.Context, req *QueryMailboxRequest) (*QueryMailboxResponse, error) {
// 1. Lookup mailbox in routing table
route, err := p.RouteMailboxQuery(ctx, req.MailboxID)
if err != nil {
return nil, status.Errorf(codes.NotFound, "mailbox not found: %s", req.MailboxID)
}

// 2. Check if this proxy owns the mailbox
if route.ProxyID == p.proxyID {
// Local query - forward to local runner
runner, err := p.getRunner(route.RunnerID)
if err != nil {
return nil, status.Errorf(codes.Internal, "runner not found: %s", route.RunnerID)
}

return runner.Query(ctx, req)
}

// 3. Remote query - forward to owning proxy
log.Debug("Forwarding mailbox query",
"mailbox_id", req.MailboxID,
"from_proxy", p.proxyID,
"to_proxy", route.ProxyID,
"address", route.Address)

proxyConn, err := p.getProxyConnection(route.Address)
if err != nil {
return nil, status.Errorf(codes.Unavailable, "failed to connect to proxy: %v", err)
}

proxyClient := NewProxyServiceClient(proxyConn)
return proxyClient.QueryMailbox(ctx, req)
}

Alignment with RFC-048

Partition-Based Mailbox Assignment

Mailboxes are assigned to partitions using the same consistent hashing from RFC-048:

// Compute partition for mailbox (same as namespace partitioning)
func ComputeMailboxPartition(mailboxID string) int32 {
hash := crc32.ChecksumIEEE([]byte(mailboxID))
return int32(hash % 256) // 0-255
}

// Example:
// $admin@proxy-01 → hash=42 → partition=42
// audit-logs@proxy-02 → hash=155 → partition=155

Proxy Partition Ranges

Admin assigns partition ranges to proxies (RFC-048):

Proxy-01: partitions [0-63]   → owns mailboxes in this range
Proxy-02: partitions [64-127] → owns mailboxes in this range
Proxy-03: partitions [128-191] → owns mailboxes in this range
Proxy-04: partitions [192-255] → owns mailboxes in this range

Namespace Configuration with Partition

namespaces:
- name: $admin
pattern: mailbox
partition_strategy: consistent_hash # From RFC-048
assigned_partition: 42 # Computed from mailbox_id
assigned_proxy: proxy-01 # Owns partition 42
lease:
ttl: 300s

Storage Schema (ADR-054)

Mailbox Leases Table

CREATE TABLE IF NOT EXISTS mailbox_leases (
mailbox_id TEXT PRIMARY KEY,
runner_id TEXT NOT NULL,
proxy_id TEXT NOT NULL,
partition INTEGER NOT NULL,
lease_expires INTEGER NOT NULL,
last_heartbeat INTEGER NOT NULL,
address TEXT NOT NULL,
status TEXT NOT NULL,
created_at INTEGER NOT NULL,
updated_at INTEGER NOT NULL
);

CREATE INDEX idx_lease_expires ON mailbox_leases(lease_expires);
CREATE INDEX idx_status ON mailbox_leases(status);
CREATE INDEX idx_proxy_partition ON mailbox_leases(proxy_id, partition);

Mailbox Routing Table

CREATE TABLE IF NOT EXISTS mailbox_routes (
mailbox_id TEXT PRIMARY KEY,
proxy_id TEXT NOT NULL,
runner_id TEXT NOT NULL,
address TEXT NOT NULL,
partition INTEGER NOT NULL,
created_at INTEGER NOT NULL,
updated_at INTEGER NOT NULL,

FOREIGN KEY (mailbox_id) REFERENCES mailbox_leases(mailbox_id)
);

CREATE INDEX idx_proxy_id ON mailbox_routes(proxy_id);
CREATE INDEX idx_partition ON mailbox_routes(partition);

Protobuf Protocol

syntax = "proto3";

package prism.admin.v1;

service ControlPlane {
// Mailbox lease management
rpc AcquireLease(AcquireLeaseRequest) returns (AcquireLeaseResponse);
rpc ReleaseLease(ReleaseLeaseRequest) returns (ReleaseLeaseResponse);
rpc MailboxHeartbeat(MailboxHeartbeatRequest) returns (MailboxHeartbeatResponse);

// Routing table updates (admin → proxy)
rpc UpdateMailboxRoute(MailboxRouteUpdate) returns (MailboxRouteUpdateAck);
rpc RemoveMailboxRoute(MailboxRouteRemoval) returns (MailboxRouteRemovalAck);
}

message AcquireLeaseRequest {
string mailbox_id = 1; // $admin@proxy-01
string runner_id = 2; // runner-01
string proxy_id = 3; // proxy-01
int32 partition = 4; // 0-255
string address = 5; // proxy-01.prism.local:8982
int64 ttl_seconds = 6; // 300
}

message AcquireLeaseResponse {
bool success = 1;
string message = 2;
google.protobuf.Timestamp lease_expires = 3;
}

message ReleaseLeaseRequest {
string mailbox_id = 1;
string runner_id = 2;
}

message ReleaseLeaseResponse {
bool released = 1;
string message = 2;
}

message MailboxHeartbeatRequest {
string mailbox_id = 1;
string runner_id = 2;
string proxy_id = 3;
int64 events_stored = 4;
int64 queries_served = 5;
int64 storage_size_bytes = 6;
string status = 7; // healthy, degraded
}

message MailboxHeartbeatResponse {
bool lease_renewed = 1;
string message = 2;
google.protobuf.Timestamp new_expiration = 3;
}

message MailboxRouteUpdate {
string mailbox_id = 1;
string proxy_id = 2;
string runner_id = 3;
string address = 4;
int32 partition = 5;
}

message MailboxRouteUpdateAck {
bool success = 1;
}

message MailboxRouteRemoval {
string mailbox_id = 1;
}

message MailboxRouteRemovalAck {
bool success = 1;
}

Local-First Testing (ADR-004)

Single Admin Process

# Start local admin (no raft cluster needed for dev)
prism-admin --storage-path /tmp/admin-test.db --standalone

# Start proxy with pattern runner
prism-proxy --admin-endpoint localhost:8981 --proxy-id proxy-01

# Pattern runner acquires lease from local admin
# No etcd, no Redis, no external dependencies

Test Configuration

# dev/test config
admin:
mode: standalone # Single node, no raft replication
storage_path: /tmp/admin-test.db

mailbox:
lease:
ttl: 60s # Shorter TTL for faster tests
heartbeat_interval: 10s
grace_period: 10s

Migration Path

Phase 1: Basic Lease Management (Week 1-2)

  • Implement AcquireLease, ReleaseLease, Heartbeat RPCs in admin
  • Add mailbox_leases table to SQLite schema
  • Pattern runner lease acquisition on startup
  • Basic lease expiration background job

Phase 2: Routing Coordination (Week 3)

  • Add mailbox_routes table
  • Implement routing table distribution (admin → proxy)
  • Proxy routing table updates
  • Query forwarding logic

Phase 3: Production Hardening (Week 4)

  • Raft consensus for lease operations
  • Lease renewal backoff strategies
  • Graceful shutdown with lease release
  • Metrics and monitoring

Phase 4: Advanced Features (Future)

  • Lease rebalancing on proxy join/leave
  • Lease transfer between runners
  • Multi-region lease coordination (RFC-012)

Trade-Offs and Alternatives

Why Not etcd?

Trade-Off: Operational complexity vs purpose-built coordination

Decision: Operational simplicity > specialized tool

Rationale:

  • Prism targets single-tenant and small-scale deployments (10-100 proxies)
  • Admin plane already provides raft consensus
  • etcd adds 3-5 more nodes to manage, monitor, backup
  • Conflicts with local-first testing philosophy

Why Not Redis?

Trade-Off: Performance vs consistency guarantees

Decision: Strong consistency > millisecond latency

Rationale:

  • Lease conflicts cause split-brain (multiple runners own same mailbox)
  • Redis key expiration timing not precise enough
  • Sentinel/Cluster adds complexity comparable to etcd
  • Already using Redis as data backend, not control plane

Why Raft in Admin?

Trade-Off: Admin as critical path vs unified architecture

Decision: Accept admin dependency for architectural simplicity

Rationale:

  • Admin already critical for namespace coordination (RFC-047)
  • Raft consensus provides same guarantees as etcd
  • Zero new infrastructure dependencies
  • Unified monitoring and operations

Success Criteria

  1. ✅ Pattern runner acquires lease on startup (<100ms)
  2. ✅ Lease heartbeats every 60s with <50ms latency
  3. ✅ Expired leases cleaned up within grace period (60s)
  4. ✅ Routing table distributed to all proxies (<1s)
  5. ✅ Mailbox queries route to correct runner (99.9% success)
  6. ✅ Local-first testing works with single admin process
  7. ✅ Graceful shutdown releases lease within 5s
  8. ✅ Admin plane handles 100 concurrent pattern runners

Open Questions

  1. Lease Transfer: Should leases be transferable between runners for rebalancing?

    • Initial Answer: No, expire and re-acquire instead (simpler)
  2. Lease Priority: Should some mailboxes have longer TTLs than others?

    • Initial Answer: Yes, configure per-namespace lease policy
  3. Multi-Region Leases: How to coordinate leases across regions?

    • Defer: See RFC-012 for multi-cluster coordination
  4. Lease Conflict Resolution: What if two runners claim same mailbox?

    • Answer: Raft ensures only one lease granted (linearizable writes)
  5. Pattern Runner Auto-Scaling: Should admin spawn runners automatically?

    • Defer: Phase 4 feature, manual runner deployment initially

References

  • RFC-037: Mailbox Pattern - Searchable Event Store
  • RFC-047: Cross-Proxy Namespace Reservation with Lease Management
  • RFC-048: Cross-Proxy Partition Strategies and Request Forwarding
  • ADR-055: Proxy-Admin Control Plane Protocol
  • ADR-054: Prism-Admin SQLite Storage
  • ADR-004: Local-First Testing Strategy

Revision History

  • 2025-10-27 (v1): Initial draft - Mailbox lifecycle, lease management, routing coordination with raft-based admin recommendation