RFC-035: Pattern Process Launcher with Bulkhead Isolation
Summary
This RFC proposes a lightweight process launcher for pattern executables that can run headless and answer launch requests from the Prism proxy. The launcher will be an optional component (alternatives include Kubernetes deployments, systemd, or other orchestrators) that provides lifecycle management using the bulkhead isolation pattern (via pkg/isolation
) and robust process management (via pkg/procmgr
). The launcher will support three isolation levels: None, Namespace, and Session, ensuring fault isolation and proper resource boundaries.
Motivation
Current Situation
Prism patterns (Consumer, Producer, Multicast Registry, Claim Check, etc.) currently run as standalone executables that must be launched manually or via external orchestration. The Prism proxy needs a way to:
- Launch pattern processes on demand: When a client requests a pattern operation, the proxy must ensure the corresponding pattern process is running
- Manage process lifecycle: Start, monitor health, restart on failure, graceful shutdown
- Isolate failures: Prevent one namespace/session's failures from affecting others (bulkhead pattern)
- List available patterns: Proxy needs to know which patterns are available and their status
- Support multiple deployment models: Local development (direct exec), containerized (Podman/Docker), orchestrated (Kubernetes)
Why an Optional Launcher?
The pattern process launcher is optional because different deployment models have different orchestration needs:
Deployment Model | Orchestration Method | When to Use Launcher |
---|---|---|
Local Development | Direct exec, Launcher | ✅ Use Launcher - simplest local workflow |
Docker Compose | Compose services | ❌ Compose handles lifecycle |
Kubernetes | Deployments, StatefulSets | ❌ K8s handles lifecycle |
Bare Metal / VMs | systemd, Launcher | ✅ Use Launcher - lightweight alternative to systemd |
Serverless (Lambda) | Function invocation | ❌ Platform handles lifecycle |
Key insight: The launcher provides proxy-driven lifecycle control (proxy decides when to start/stop patterns) rather than external orchestration (K8s/systemd decides independently).
Bulkhead Isolation Pattern
The bulkhead pattern (from ship design: compartmentalized hull sections prevent total flooding) isolates processes into separate "compartments" to prevent cascading failures:
┌─────────────────────────────────────────────────────┐
│ Pattern Process Launcher (Headless Daemon) │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Isolation │ │ Process │ │
│ │ Manager │←→│ Manager │ │
│ │ (Bulkhead) │ │ (procmgr) │ │
│ └────────────────┘ └────────────────┘ │
│ ↓ ↓ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Process Pool (Isolated by Level) │ │
│ │ │ │
│ │ Isolation Level: Namespace │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ ns:tenant-a │ │ ns:tenant-b │ │ │
│ │ │ Consumer │ │ Consumer │ │ │
│ │ │ Process │ │ Process │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Isolation Level: Session │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │session:user-1│ │session:user-2│ │ │
│ │ │ Producer │ │ Producer │ │ │
│ │ │ Process │ │ Process │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ Isolation Level: None │ │
│ │ ┌─────────────┐ │ │
│ │ │ shared │ │ │
│ │ │ Registry │ │ │
│ │ │ Process │ │ │
│ │ └─────────────┘ │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
↑ ↑
│ gRPC Launch API │ Health/Status
│ │
┌────────┴────────┐ ┌────────┴────────┐
│ Prism Proxy │ │ Monitoring │
│ (Rust) │ │ (Prometheus) │
└─────────────────┘ └─────────────────┘
Design
Architecture Components
1. Pattern Process Launcher (cmd/pattern-launcher
)
Headless daemon that:
- Listens on gRPC for launch requests from proxy
- Uses
pkg/isolation.IsolationManager
to manage process pools - Uses
pkg/procmgr.ProcessManager
for robust process lifecycle - Discovers available patterns via filesystem (executable manifests)
- Exports Prometheus metrics and health endpoints
type PatternLauncher struct {
// Configuration
config *LauncherConfig
isolationLevel isolation.IsolationLevel
// Management
isolationMgr *isolation.IsolationManager
// Pattern discovery
patterns map[string]*PatternManifest
patternsMu sync.RWMutex
// gRPC server
grpcServer *grpc.Server
}
2. Pattern Manifest (patterns/<name>/manifest.yaml
)
Declarative configuration for each pattern:
name: consumer
version: 1.0.0
executable: ./patterns/consumer/consumer
isolation_level: namespace # none | namespace | session
healthcheck:
port: 9090
path: /health
interval: 30s
resources:
cpu_limit: 1.0
memory_limit: 512Mi
backend_slots:
- name: storage
type: postgres
required: true
- name: messaging
type: kafka
required: true
environment:
LOG_LEVEL: info
METRICS_PORT: "9091"
3. Launch gRPC API
service PatternLauncher {
// Launch or get existing pattern process
rpc LaunchPattern(LaunchRequest) returns (LaunchResponse);
// List all running pattern processes
rpc ListPatterns(ListPatternsRequest) returns (ListPatternsResponse);
// Terminate a pattern process
rpc TerminatePattern(TerminateRequest) returns (TerminateResponse);
// Health check
rpc Health(HealthRequest) returns (HealthResponse);
}
message LaunchRequest {
string pattern_name = 1; // e.g., "consumer", "producer"
IsolationLevel isolation = 2; // NONE, NAMESPACE, SESSION
string namespace = 3; // Tenant namespace (for NAMESPACE isolation)
string session_id = 4; // Session ID (for SESSION isolation)
map<string, string> config = 5; // Pattern-specific config
}
message LaunchResponse {
string process_id = 1; // Unique process ID
ProcessState state = 2; // STARTING, RUNNING, TERMINATING, etc.
string address = 3; // gRPC address to connect to pattern
bool healthy = 4;
}
message ListPatternsResponse {
repeated PatternInfo patterns = 1;
}
message PatternInfo {
string pattern_name = 1;
string process_id = 2;
ProcessState state = 3;
string address = 4;
bool healthy = 5;
int64 uptime_seconds = 6;
string namespace = 7;
string session_id = 8;
}
enum IsolationLevel {
ISOLATION_NONE = 0;
ISOLATION_NAMESPACE = 1;
ISOLATION_SESSION = 2;
}
enum ProcessState {
STATE_STARTING = 0;
STATE_RUNNING = 1;
STATE_TERMINATING = 2;
STATE_TERMINATED = 3;
STATE_FAILED = 4;
}
Isolation Levels Explained
Isolation Level: NONE (Shared Process Pool)
All requests share the same process, regardless of namespace or session.
Use Case: Stateless patterns with no tenant-specific data (e.g., schema registry lookup)
Example:
Client A (namespace: tenant-a, session: user-1) ──┐
Client B (namespace: tenant-b, session: user-2) ──┼─→ shared:consumer (single process)
Client C (namespace: tenant-a, session: user-3) ──┘
Benefits:
- ✅ Lowest resource usage (one process serves all)
- ✅ Simplest management
Risks:
- ❌ No fault isolation (one bug affects all tenants)
- ❌ No resource isolation (noisy neighbor problem)
Isolation Level: NAMESPACE (Tenant Isolation)
Each namespace gets its own dedicated process. Multiple sessions within the same namespace share the process.
Use Case: Multi-tenant SaaS where tenants must be isolated (data security, billing, fault isolation)
Example:
Client A (namespace: tenant-a, session: user-1) ──┐
Client C (namespace: tenant-a, session: user-3) ──┼─→ ns:tenant-a:consumer
Client B (namespace: tenant-b, session: user-2) ────→ ns:tenant-b:consumer
Benefits:
- ✅ Fault isolation: tenant-a's crash doesn't affect tenant-b
- ✅ Resource quotas: limit CPU/memory per tenant
- ✅ Billing: track resource usage per tenant
Risks:
- ⚠️ Higher resource usage (one process per namespace)
- ⚠️ Cold start latency for new namespaces
Isolation Level: SESSION (Maximum Isolation)
Each session gets its own dedicated process. Maximum isolation guarantees.
Use Case: High-security environments, compliance requirements (PCI-DSS, HIPAA), debugging
Example:
Client A (namespace: tenant-a, session: user-1) ───→ session:user-1:consumer
Client B (namespace: tenant-b, session: user-2) ───→ session:user-2:consumer
Client C (namespace: tenant-a, session: user-3) ───→ session:user-3:consumer
Benefits:
- ✅ Maximum fault isolation: one session crash = one user affected
- ✅ Security: no cross-session data leakage possible
- ✅ Debugging: session-level logs and metrics
Risks:
- ❌ Highest resource usage (one process per session)
- ❌ Significant cold start latency
- ❌ Management overhead (thousands of processes possible)
Process Lifecycle with procmgr Integration
The launcher uses pkg/procmgr.ProcessManager
for robust lifecycle management:
// Pattern process syncer implementation
type patternProcessSyncer struct {
launcher *PatternLauncher
}
func (s *patternProcessSyncer) SyncProcess(ctx context.Context, updateType procmgr.UpdateType, config interface{}) (terminal bool, err error) {
processConfig := config.(*ProcessConfig)
manifest := s.launcher.patterns[processConfig.PatternName]
// Build command
cmd := exec.CommandContext(ctx, manifest.Executable)
cmd.Env = append(os.Environ(),
fmt.Sprintf("PATTERN_NAME=%s", processConfig.PatternName),
fmt.Sprintf("NAMESPACE=%s", processConfig.Namespace),
fmt.Sprintf("SESSION_ID=%s", processConfig.SessionID),
fmt.Sprintf("GRPC_PORT=%d", processConfig.GRPCPort),
)
// Start process
if err := cmd.Start(); err != nil {
return false, fmt.Errorf("start process: %w", err)
}
// Store process handle
s.launcher.storeProcessHandle(processConfig.ProcessID, cmd.Process)
// Wait for health check to pass
if err := s.launcher.waitForHealthy(ctx, processConfig); err != nil {
cmd.Process.Kill()
return false, fmt.Errorf("health check failed: %w", err)
}
// Check if process exited (terminal state)
select {
case <-ctx.Done():
return false, ctx.Err()
default:
// Process still running
return false, nil
}
}
func (s *patternProcessSyncer) SyncTerminatingProcess(ctx context.Context, config interface{}, gracePeriodSecs *int64, statusFn procmgr.ProcessStatusFunc) error {
processConfig := config.(*ProcessConfig)
process := s.launcher.getProcessHandle(processConfig.ProcessID)
// Send SIGTERM for graceful shutdown
if err := process.Signal(syscall.SIGTERM); err != nil {
return fmt.Errorf("send SIGTERM: %w", err)
}
// Wait for graceful exit
timeout := time.Duration(*gracePeriodSecs) * time.Second
done := make(chan error, 1)
go func() {
_, err := process.Wait()
done <- err
}()
select {
case err := <-done:
// Process exited gracefully
return err
case <-time.After(timeout):
// Grace period expired, force kill
process.Kill()
return fmt.Errorf("forced kill after grace period")
}
}
func (s *patternProcessSyncer) SyncTerminatedProcess(ctx context.Context, config interface{}) error {
processConfig := config.(*ProcessConfig)
// Cleanup resources
s.launcher.removeProcessHandle(processConfig.ProcessID)
return nil
}
Launch Request Flow
1. Proxy receives client request for pattern operation
│
├─→ Check if pattern process already running (cache lookup)
│ ├─ Yes: Use existing process
│ └─ No: Send LaunchPattern gRPC request
│
2. Launcher receives LaunchPattern request
│
├─→ Determine ProcessID based on isolation level
│ ├─ NONE: "shared:<pattern>"
│ ├─ NAMESPACE: "ns:<namespace>:<pattern>"
│ └─ SESSION: "session:<session_id>:<pattern>"
│
├─→ IsolationManager.GetOrCreateProcess(isolationKey, processConfig)
│ │
│ ├─→ Check if process exists and healthy
│ │ ├─ Yes: Return existing handle
│ │ └─ No: Create new process
│ │
│ └─→ ProcessManager.UpdateProcess(CREATE)
│ │
│ ├─→ patternProcessSyncer.SyncProcess()
│ │ ├─ exec.Command() - start pattern executable
│ │ ├─ Wait for health check
│ │ └─ Return success
│ │
│ └─→ Return ProcessHandle
│
3. Launcher returns LaunchResponse
│
└─→ Proxy caches process address and forwards client request
Pattern Discovery
The launcher discovers available patterns by scanning directories:
patterns/
├── consumer/
│ ├── consumer # Executable binary
│ ├── manifest.yaml # Pattern metadata
│ └── README.md
├── producer/
│ ├── producer
│ ├── manifest.yaml
│ └── README.md
├── multicast_registry/
│ ├── multicast_registry
│ ├── manifest.yaml
│ └── README.md
└── claimcheck/
├── claimcheck
├── manifest.yaml
└── README.md
Discovery algorithm:
- Scan
patterns/
directory - For each subdirectory, check for
manifest.yaml
- Validate manifest schema
- Check executable exists and is runnable
- Load pattern into registry
Configuration
Launcher configuration (~/.prism/launcher-config.yaml
):
launcher:
# Port for gRPC API
grpc_port: 8982
# Pattern discovery
patterns_dir: ./patterns
# Default isolation level (can be overridden per pattern)
default_isolation: namespace
# Process manager settings
process_manager:
resync_interval: 30s
backoff_period: 5s
max_concurrent_starts: 10
# Resource limits (applied to all pattern processes)
resources:
cpu_limit: 2.0
memory_limit: 1Gi
# Metrics and observability
metrics:
port: 9092
path: /metrics
health:
port: 9093
path: /health
Implementation Phases
Phase 1: Core Launcher (Week 1)
Deliverables:
cmd/pattern-launcher
skeleton with gRPC server- Pattern manifest schema and validation
- Pattern discovery (scan filesystem for manifests)
- Integration with
pkg/isolation.IsolationManager
- LaunchPattern API (no actual process launch yet, just mock)
Tests:
- Pattern discovery finds all valid manifests
- Invalid manifests rejected
- gRPC API responds to LaunchPattern requests
- IsolationManager creates correct ProcessIDs
Phase 2: Process Launch (Week 2)
Deliverables:
patternProcessSyncer
implementationexec.Command()
integration for spawning processes- Process handle tracking (PID, port, address)
- Health check polling (HTTP
/health
endpoint) - LaunchPattern returns running process address
Tests:
- Launch single pattern process successfully
- Health check waits until process ready
- Failed process launch returns error
- Process address returned correctly
Phase 3: Isolation Levels (Week 3) ✅ COMPLETE
Deliverables:
- ✅ Namespace isolation: one process per namespace
- ✅ Session isolation: one process per session
- ✅ None isolation: shared process
- ✅ Concurrent launch requests handled correctly
- ✅ Process reuse for existing isolation keys
Tests (pkg/launcher/integration_test.go
):
- ✅
TestIsolationLevels_Integration
: All 3 isolation levels (NONE, NAMESPACE, SESSION)- Verifies process reuse for same isolation key
- Verifies process separation for different keys
- Validates PID tracking and address assignment
- ✅
TestConcurrentLaunches
: 5 concurrent requests correctly reuse process - ✅
TestProcessTermination
: Graceful termination with status verification - ✅
TestHealthCheck
: Health endpoint monitoring and service health reporting
Results:
- Unit tests: 100% passing (run with
-short
flag, 0.2s) - Integration tests: Created (require actual process launching)
- Test pattern binary: Built and ready (7.4MB, Go-based with HTTP health endpoint)
Phase 4: Termination and Cleanup (Week 4)
Deliverables:
- TerminatePattern API
- Graceful SIGTERM with timeout
- Force SIGKILL after grace period
- Process cleanup (remove from tracking)
- Orphan process detection and cleanup
Tests:
- Graceful shutdown completes within grace period
- Force kill after timeout
- Terminated processes removed from list
- Orphaned processes detected and terminated
Phase 5: Metrics and Observability (Week 5)
Deliverables:
- Prometheus metrics export
- Process lifecycle metrics (starts, stops, failures)
- Isolation level distribution metrics
- Resource usage per process
- Health endpoint with detailed status
Tests:
- Metrics exported correctly
- Counter increases on process start/stop
- Health endpoint returns all processes
- Resource metrics tracked accurately
Usage Examples
Example 1: Proxy Launches Consumer Pattern
// In Prism proxy (Rust), making gRPC call to launcher
func launchConsumerPattern(namespace string) (string, error) {
client := NewPatternLauncherClient(conn)
resp, err := client.LaunchPattern(ctx, &LaunchRequest{
PatternName: "consumer",
Isolation: IsolationLevel_ISOLATION_NAMESPACE,
Namespace: namespace,
Config: map[string]string{
"kafka_brokers": "localhost:9092",
"consumer_group": fmt.Sprintf("%s-consumer", namespace),
},
})
if err != nil {
return "", fmt.Errorf("launch consumer: %w", err)
}
// Cache the process address for future requests
proxyCache.Set(namespace, "consumer", resp.Address)
return resp.Address, nil
}
Example 2: Local Development Workflow
# Terminal 1: Start pattern launcher
cd cmd/pattern-launcher
go run . --config ~/.prism/launcher-config.yaml
# Terminal 2: Use prismctl to launch pattern
prismctl pattern launch consumer --namespace tenant-a --isolation namespace
# Terminal 3: Check running patterns
prismctl pattern list
# Output:
# PATTERN PROCESS ID STATE HEALTHY UPTIME
# consumer ns:tenant-a:consumer RUNNING true 5m30s
# producer ns:tenant-a:producer RUNNING true 3m15s
# multicast_registry shared:registry RUNNING true 10m45s
Example 3: Kubernetes Alternative (No Launcher Needed)
In Kubernetes, the launcher is not used. Instead, patterns are deployed as Deployments:
# patterns/consumer/k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: consumer-pattern
spec:
replicas: 3
selector:
matchLabels:
app: consumer
template:
metadata:
labels:
app: consumer
spec:
containers:
- name: consumer
image: prism/consumer:1.0.0
env:
- name: ISOLATION_LEVEL
value: "namespace"
- name: GRPC_PORT
value: "50051"
resources:
limits:
cpu: 1.0
memory: 512Mi
The Prism proxy discovers pattern processes via Kubernetes service discovery (DNS, endpoints API).
Metrics and Observability
Prometheus Metrics
# Pattern lifecycle
pattern_launcher_process_starts_total{pattern, namespace, isolation} counter
pattern_launcher_process_stops_total{pattern, namespace, isolation} counter
pattern_launcher_process_failures_total{pattern, namespace, isolation} counter
# Process state
pattern_launcher_processes_running{pattern, isolation} gauge
pattern_launcher_processes_terminating{pattern, isolation} gauge
# Launch latency
pattern_launcher_launch_duration_seconds{pattern, isolation} histogram
# Isolation distribution
pattern_launcher_isolation_level{level} gauge
# Resource usage (per process)
pattern_launcher_process_cpu_usage{process_id} gauge
pattern_launcher_process_memory_bytes{process_id} gauge
Health Check Response
{
"status": "healthy",
"total_processes": 15,
"running_processes": 13,
"terminating_processes": 2,
"failed_processes": 0,
"isolation_distribution": {
"none": 2,
"namespace": 10,
"session": 3
},
"processes": [
{
"pattern_name": "consumer",
"process_id": "ns:tenant-a:consumer",
"state": "RUNNING",
"healthy": true,
"uptime_seconds": 3600,
"namespace": "tenant-a",
"address": "localhost:50051"
}
]
}
Security Considerations
- Process Isolation: Use OS-level process isolation (cgroups, namespaces) to prevent cross-contamination
- Resource Limits: Enforce CPU/memory limits per process to prevent resource exhaustion
- Authentication: gRPC API requires mTLS or OIDC token authentication
- Authorization: Only authorized namespaces can launch patterns
- Audit Logging: All launch/terminate operations logged for security audit
- Secret Management: Pattern configs may contain secrets (DB passwords) - use secret providers
Performance Considerations
- Cold Start Latency: First request for a namespace incurs process spawn latency (~500ms-2s)
- Process Reuse: Subsequent requests to same namespace reuse existing process (< 10ms)
- Concurrent Launches: ProcessManager handles concurrent launch requests without race conditions
- Memory Overhead: Each process consumes memory (baseline ~50MB + pattern-specific usage)
- CPU Overhead: Process management goroutines negligible (< 1% CPU)
Optimization: Implement warm pool for common patterns (pre-launch consumer processes for popular namespaces).