Skip to main content

RFC-054: Netflix-Inspired Platform Enhancements for Prism

Status: Proposed Author: Platform Team Created: 2025-11-14 Updated: 2025-11-14

Abstract

This RFC proposes platform-level enhancements to Prism inspired by Netflix's Data Gateway architecture and operational learnings. Netflix's decade-plus experience running a data abstraction layer at global scale provides valuable lessons for making Prism more resilient, cost-efficient, and developer-friendly.

Key Enhancements:

  1. Platform-Level Resilience - Circuit breakers, adaptive load shedding, backpressure at proxy layer
  2. Automated Data Lifecycle Management - TTL policies, automated cleanup, cost tracking per namespace
  3. Multi-Modal Access Patterns - Same data accessible through different pattern abstractions
  4. Capability-Aware Routing - Intelligent backend selection based on operation requirements
  5. Standardized Observability - Platform-wide metrics, tracing, and cost attribution

Netflix demonstrated that a well-designed platform layer can shield applications from database complexity while maintaining extreme reliability (99.99%+ uptime) at massive scale (petabytes of data, billions of operations/day).

Motivation

Netflix's Data Gateway Success

Netflix's Data Gateway platform provides several specialized abstractions:

  • Key-Value DAL: Simple CRUD operations with multi-backend support (Cassandra, EVCache, DynamoDB)
  • TimeSeries: Temporal event data with time-based queries (Cassandra, Elasticsearch)
  • Distributed Counter: Counting with configurable accuracy/latency trade-offs (EVCache, Cassandra)
  • Write-Ahead Log: Durable event logging with guaranteed ordering (already in RFC-051)

Key Characteristics:

  1. Abstraction Platform - Common infrastructure for multiple data access patterns
  2. Resilience First - Circuit breakers, load shedding, backpressure built into platform
  3. Data Lifecycle - TTL, cleanup, and cost management as first-class concerns
  4. Operational Excellence - Automated capacity planning, incident learning, best practice enforcement

Current Prism Architecture

Prism has similar pattern abstractions (RFC-014, RFC-046):

  • KeyValue Pattern - Key-value storage with backend slots
  • PubSub/Queue Patterns - Messaging with producer/consumer abstractions
  • Multicast Registry - Identity registration with selective multicast
  • Graph Pattern - Graph traversal and algorithms (RFC-053)
  • Session Store - Distributed session management

What's Missing:

  1. Platform-level resilience - Patterns implement their own circuit breaking
  2. Standardized data lifecycle - Each pattern handles TTL differently
  3. Cost visibility - No unified cost tracking across patterns
  4. Multi-modal access - Can't access same data through different patterns
  5. Capability negotiation - Limited backend selection intelligence

Goals

  1. Add Platform-Level Resilience - Circuit breakers, load shedding, backpressure in proxy
  2. Automate Data Lifecycle - TTL policies, cleanup automation, cost tracking
  3. Enable Multi-Modal Access - Same backend data through multiple pattern views
  4. Intelligent Backend Selection - Route operations to optimal backend based on capabilities
  5. Standardize Observability - Unified metrics, tracing, cost attribution

Non-Goals

  • Replacing existing pattern designs (build on top of RFC-014, RFC-046)
  • Changing protobuf APIs (additive enhancements only)
  • Requiring application code changes (platform improvements are transparent)

Platform-Level Resilience

Problem: Resilience Currently Per-Pattern

Current State: Each pattern implements its own resilience:

// patterns/keyvalue/pattern.go
func (kv *KeyValuePattern) Store(ctx context.Context, req *StoreRequest) (*StoreResponse, error) {
// Pattern-specific circuit breaker
if kv.circuitBreaker.IsOpen() {
return nil, errors.New("circuit breaker open")
}

// Pattern-specific retry logic
return kv.backend.Set(ctx, req.Key, req.Value)
}

Problems:

  • Duplicate circuit breaker implementations across patterns
  • Inconsistent behavior (different timeout/retry configs)
  • No cross-pattern coordination (all patterns fail independently)
  • Difficult to configure (must set per-pattern)

Solution: Platform-Level Resilience Layer

Architecture:

┌─────────────────────────────────────────────────────────┐
│ Client Application │
└────────────────────┬────────────────────────────────────┘
│ gRPC request

┌─────────────────────────────────────────────────────────┐
│ Prism Proxy (Rust) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Platform Resilience Layer (NEW) │ │
│ │ │ │
│ │ 1. Circuit Breaker (per backend + global) │ │
│ │ 2. Adaptive Load Shedding (based on latency) │ │
│ │ 3. Backpressure (queue depth monitoring) │ │
│ │ 4. Request Hedging (parallel requests) │ │
│ │ 5. Timeout Management (operation-specific) │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Pattern Layer (KeyValue, PubSub, etc.) │ │
│ │ - Resilience handled by platform │ │
│ │ - Focus on business logic only │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Circuit Breaker Configuration

Per-Backend Circuit Breakers:

namespaces:
- name: user-sessions
pattern: keyvalue
backend:
type: redis

# Platform-level resilience (NEW)
resilience:
circuit_breaker:
enabled: true
error_threshold: 50 # Open after 50% error rate
min_requests: 100 # Minimum requests before opening
timeout: 30s # Stay open for 30s
half_open_requests: 10 # Test with 10 requests when half-open

adaptive_load_shedding:
enabled: true
target_latency_p99: 100ms # Target P99 latency
shed_probability: 0.5 # Start shedding at 50% over target
min_accept_rate: 0.1 # Always accept 10% of requests

backpressure:
enabled: true
queue_depth_threshold: 1000 # Start rejecting if queue > 1000
queue_latency_threshold: 5s # Reject if queue time > 5s

timeouts:
operation_timeout: 5s # Per-operation timeout
connection_timeout: 1s # Backend connection timeout

Global Circuit Breaker:

# Proxy-level configuration
proxy:
global_resilience:
circuit_breaker:
enabled: true
# Open if ANY backend has >80% error rate
global_error_threshold: 0.8
# Fail fast for all namespaces
fail_fast_on_global_open: true

Adaptive Load Shedding (Netflix Pattern)

Problem: Traffic spikes saturate backend, causing cascading failures.

Netflix Solution: Shed load based on latency feedback loops.

Implementation:

// pkg/proxy/resilience/load_shedding.rs
pub struct AdaptiveLoadShedder {
target_latency: Duration,
current_latency: Arc<AtomicU64>, // P99 latency
shed_probability: Arc<AtomicU64>, // 0-100%
}

impl AdaptiveLoadShedder {
pub fn should_accept_request(&self) -> bool {
let current_p99 = Duration::from_nanos(
self.current_latency.load(Ordering::Relaxed)
);

if current_p99 <= self.target_latency {
return true; // Fast path: latency good, accept all
}

// Latency over target: start shedding
let overage_ratio = current_p99.as_millis() as f64
/ self.target_latency.as_millis() as f64;

// Shed probability increases linearly with overage
// 2× target latency = 50% shed, 3× = 66% shed, etc.
let shed_prob = 1.0 - (1.0 / overage_ratio);

// Random accept/reject based on shed probability
rand::random::<f64>() > shed_prob
}

pub fn update_latency(&self, latency: Duration) {
// Update P99 latency with exponential moving average
self.current_latency.store(
latency.as_nanos() as u64,
Ordering::Relaxed
);
}
}

Client Experience:

Normal conditions (P99 < 100ms):
- All requests accepted ✅

Moderate load (P99 = 200ms, 2× target):
- 50% of requests shed ⚠️
- Error: 429 Too Many Requests (shed_reason: "latency_overload")

High load (P99 = 500ms, 5× target):
- 80% of requests shed 🚨
- Backend protected from saturation
- Error: 429 Too Many Requests (shed_reason: "latency_overload")

Backpressure Monitoring

Problem: Proxy queue builds up, requests timeout before execution.

Solution: Monitor queue depth and latency, reject early.

// pkg/proxy/resilience/backpressure.rs
pub struct BackpressureMonitor {
queue_depth: Arc<AtomicUsize>,
max_queue_depth: usize,
queue_wait_times: Arc<Mutex<VecDeque<Duration>>>,
}

impl BackpressureMonitor {
pub fn check_accept_request(&self) -> Result<(), Error> {
// Check queue depth
let depth = self.queue_depth.load(Ordering::Relaxed);
if depth > self.max_queue_depth {
return Err(Error::QueueFull {
current_depth: depth,
max_depth: self.max_queue_depth,
});
}

// Check queue latency (median of last 100 requests)
let wait_times = self.queue_wait_times.lock().unwrap();
let median_wait = calculate_median(&wait_times);

if median_wait > Duration::from_secs(5) {
return Err(Error::QueueLatencyTooHigh {
median_wait,
threshold: Duration::from_secs(5),
});
}

Ok(())
}
}

Metrics:

# Queue depth (target: <1000)
prism_queue_depth{namespace="user-sessions"} 850

# Queue wait time (target: <1s)
prism_queue_wait_seconds{namespace="user-sessions", quantile="0.5"} 0.2
prism_queue_wait_seconds{namespace="user-sessions", quantile="0.99"} 3.5

# Rejection rate due to backpressure
rate(prism_requests_rejected_total{reason="queue_full"}[5m]) 120/sec

Automated Data Lifecycle Management

Problem: Manual Data Cleanup

Current State: Applications must implement their own cleanup:

# Application code (bad - manual cleanup)
client.keyvalue.store(key="session:123", value=data)

# Remember to clean up later... (often forgotten)
schedule_cleanup("session:123", after=3600) # 1 hour

Problems:

  • Developers forget to set TTL
  • Data accumulates indefinitely
  • Storage costs balloon
  • Performance degrades (scan operations slow)

Netflix's Approach: TTL as First-Class Concern

Key Insight: Data cleanup should be automatic and declarative, not manual.

Netflix Strategies:

  1. Default TTL policies - Every key has a TTL unless explicitly persistent
  2. Snapshot Janitors - Automated cleanup of old snapshots
  3. Cost attribution - Track storage costs per namespace
  4. Automated tiering - Move old data to cheaper storage

Solution: Declarative Data Lifecycle Policies

Configuration:

namespaces:
- name: user-sessions
pattern: keyvalue

# Data lifecycle policies (NEW)
lifecycle:
default_ttl: 3600 # 1 hour default for all keys

# Pattern-specific overrides
ttl_policies:
- key_pattern: "session:*"
ttl: 3600 # 1 hour for sessions

- key_pattern: "cache:*"
ttl: 300 # 5 min for cache

- key_pattern: "user:*:profile"
ttl: 86400 # 1 day for profiles

- key_pattern: "config:*"
ttl: -1 # Persistent for config

# Automated cleanup
cleanup:
enabled: true
scan_interval: 3600 # Scan every hour
batch_size: 1000 # Delete 1000 keys per batch

# Cost tracking
cost_tracking:
enabled: true
storage_cost_per_gb_month: 0.023 # $0.023/GB/month
operation_cost_per_million: 0.40 # $0.40/million ops

# Tiered storage (Netflix pattern)
tiering:
enabled: true
hot_tier:
backend: redis # Hot: Redis
max_age: 86400 # 1 day

warm_tier:
backend: postgres # Warm: Postgres
max_age: 604800 # 7 days

cold_tier:
backend: s3 # Cold: S3
retention: 2592000 # 30 days, then delete

Client API (TTL is automatic):

# Application code (good - TTL automatic)
client.keyvalue.store(
key="session:123",
value=data
# No TTL needed - uses lifecycle policy "session:*" -> 1 hour
)

# Explicit TTL override if needed
client.keyvalue.store(
key="session:123",
value=data,
ttl=7200 # 2 hours (overrides policy)
)

Automated Tiering Example

Scenario: Session data accessed frequently for 1 day, occasionally for 7 days, rarely after.

Configuration:

lifecycle:
tiering:
enabled: true

# Day 0-1: Hot tier (Redis)
hot_tier:
backend: redis
max_age: 86400
replication: 3

# Day 1-7: Warm tier (Postgres)
warm_tier:
backend: postgres
max_age: 604800

# Day 7-30: Cold tier (S3)
cold_tier:
backend: s3
retention: 2592000
storage_class: GLACIER # Cheap archival

# Day 30+: Delete
delete_after: 2592000

Data Flow:

Day 0: Write → Redis (hot tier)
- P99 latency: 1ms
- Cost: $0.10/GB/month

Day 1: Background job moves to Postgres (warm tier)
- P99 latency: 10ms
- Cost: $0.05/GB/month

Day 7: Background job moves to S3 (cold tier)
- P99 latency: 100ms
- Cost: $0.004/GB/month (Glacier)

Day 30: Automated deletion
- Cost: $0

Cost Tracking and Attribution

Metrics:

# Storage cost per namespace
prism_storage_cost_usd_per_month{namespace="user-sessions"} 1250.00

# Operation cost per namespace
prism_operation_cost_usd_per_month{namespace="user-sessions"} 350.00

# Total cost breakdown
prism_total_cost_usd_per_month{
namespace="user-sessions",
cost_type="storage"
} 1250.00

prism_total_cost_usd_per_month{
namespace="user-sessions",
cost_type="operations"
} 350.00

# Storage by tier
prism_storage_bytes{namespace="user-sessions", tier="hot"} 500GB
prism_storage_bytes{namespace="user-sessions", tier="warm"} 2TB
prism_storage_bytes{namespace="user-sessions", tier="cold"} 10TB

Cost Dashboard:

Namespace: user-sessions
├── Storage Costs: $1,250/month
│ ├── Hot (Redis): $500/month (500GB)
│ ├── Warm (Postgres): $250/month (2TB)
│ └── Cold (S3 Glacier): $500/month (10TB)
├── Operation Costs: $350/month
│ ├── Reads: $200/month (500M/month)
│ └── Writes: $150/month (375M/month)
└── Total: $1,600/month

Recommendations:
⚠️ Consider increasing hot_tier TTL to reduce tiering overhead
⚠️ 96% of reads in hot tier (good caching)
⚠️ Cold tier has 85% unused data → reduce retention to 15 days

Multi-Modal Access Patterns

Problem: Single Access Pattern Per Backend

Current State: Each namespace has one pattern.

# Can ONLY access as KeyValue
namespaces:
- name: user-profiles
pattern: keyvalue
backend:
type: postgres

Limitations:

  • Same data can't be queried differently
  • Must duplicate data to support multiple access patterns
  • No way to run analytics on operational data

Netflix's Approach: Multiple Abstractions Over Same Data

Example: TimeSeries data accessible as:

  1. TimeSeries Pattern - Time-range queries for recent events
  2. Counter Pattern - Aggregated counts with different accuracy modes
  3. Analytics - Export to Parquet on S3 for Spark jobs

Solution: Multi-Modal Namespace Configuration

Configuration:

namespaces:
- name: user-events

# Primary pattern (write path)
primary_pattern: timeseries
backend:
type: postgres
table: user_events

# Secondary patterns (read views) - NEW
secondary_patterns:
# View 1: Counter aggregation
- pattern: counter
mode: eventually_consistent_global
aggregation:
group_by: [user_id, event_type]
window: 3600 # 1 hour windows

# View 2: KeyValue lookup
- pattern: keyvalue
key_template: "user:{user_id}:events"
materialized_view: true # Precompute for fast lookup

# View 3: Analytics export
- pattern: analytics_export
destination: s3
format: parquet
partition_by: [date, event_type]
schedule: "0 2 * * *" # Daily at 2 AM

Client Usage:

# Write via primary pattern (TimeSeries)
client.timeseries.write(
namespace="user-events",
timestamp=now(),
event_type="play_video",
user_id="user:123",
data={"video_id": "v456", "position": 120}
)

# Read via Counter pattern (aggregated)
count = client.counter.get(
namespace="user-events",
key="user:123",
event_type="play_video",
time_range="last_24h"
)
print(f"Videos played: {count}") # 15

# Read via KeyValue pattern (latest events)
events = client.keyvalue.retrieve(
namespace="user-events",
key="user:user:123:events"
)
print(f"Last 10 events: {events[:10]}")

# Analytics via S3 export
# spark.read.parquet("s3://prism-analytics/user-events/date=2025-11-14/")

Benefits:

  • Write once, read many ways
  • No data duplication
  • Pattern-specific optimizations
  • Cost-effective (single storage, multiple views)

Multi-Modal Architecture

┌────────────────────────────────────────────────────────┐
│ Client Applications │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │TimeSeries│ │ Counter │ │ KeyValue │ │
│ │ Client │ │ Client │ │ Client │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└───────┬────────────┬──────────────┬──────────────────┘
│ │ │
│ write │ read │ read
▼ ▼ ▼
┌────────────────────────────────────────────────────────┐
│ Prism Proxy (Rust) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Multi-Modal Router (NEW) │ │
│ │ - Routes to primary pattern for writes │ │
│ │ - Routes to secondary patterns for reads │ │
│ │ - Maintains consistency guarantees │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────┬───────────────────────────────────┘


┌───────────────────────┐
│ PostgreSQL Backend │
│ ┌─────────────────┐ │
│ │ user_events │ │ ← Single table
│ │ (primary) │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Materialized │ │ ← Counter view
│ │ View: counts │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Materialized │ │ ← KeyValue view
│ │ View: latest │ │
│ └─────────────────┘ │
└───────────────────────┘

Capability-Aware Backend Selection

Problem: Static Backend Assignment

Current State: Backend chosen at namespace creation.

namespaces:
- name: user-sessions
pattern: keyvalue
backend:
type: redis # Fixed choice

Limitations:

  • Can't optimize per-operation (read vs write)
  • Can't leverage multiple backends' strengths
  • Can't adapt to changing workload

Netflix's Approach: Intelligent Backend Selection

Example: EVCache for hot reads, Cassandra for durability.

Solution: Capability-Aware Routing

Configuration:

namespaces:
- name: user-sessions
pattern: keyvalue

# Multi-backend configuration (NEW)
backends:
# Hot read cache
- name: cache
type: redis
capabilities: [fast_read, ttl]
priority: 1 # Try first for reads

# Durable storage
- name: primary
type: postgres
capabilities: [durable_write, transactions, scan]
priority: 2 # Fallback for all ops

# Write-through to both
write_strategy: write_through

# Routing rules (NEW)
routing:
# Reads: Try cache first, fallback to primary
read_operations:
- try: cache
on_miss: primary

# Writes: Write to both (durability + cache)
write_operations:
- targets: [primary, cache]
consistency: write_through

# Scans: Only primary supports this
scan_operations:
- try: primary

Operation Routing:

# Read operation
value = client.keyvalue.retrieve(key="session:123")
# 1. Try Redis (cache) → Hit (1ms) ✅
# 2. If miss, query Postgres (primary) → (10ms)
# 3. Populate Redis for next read

# Write operation
client.keyvalue.store(key="session:123", value=data)
# 1. Write to Postgres (primary, durable) → (10ms)
# 2. Write to Redis (cache) → (1ms)
# 3. Acknowledge after both succeed

# Scan operation
keys = client.keyvalue.scan(prefix="session:")
# 1. Route to Postgres (only backend with scan) → (50ms)

Backend Capability Matrix

BackendCapabilitiesBest For
Redisfast_read, fast_write, ttl, pubsubHot data, caching, real-time
Postgresdurable_write, transactions, scan, complex_queryDurable data, ACID, analytics
S3cheap_storage, large_objectsArchival, backups, bulk data
Kafkaordered_stream, replay, high_throughputEvents, WAL, streaming
Neptunegraph_traversal, complex_relationshipsConnected data, recommendations

Routing Intelligence:

// pkg/proxy/routing/capability_router.rs
pub struct CapabilityRouter {
backends: Vec<Backend>,
}

impl CapabilityRouter {
pub fn select_backend(&self, operation: Operation) -> Option<&Backend> {
let required_caps = operation.required_capabilities();

// Find backends that satisfy all required capabilities
let capable = self.backends.iter()
.filter(|b| b.has_capabilities(&required_caps))
.collect::<Vec<_>>();

if capable.is_empty() {
return None; // No backend can handle this
}

// Select best backend based on:
// 1. Priority (admin-configured)
// 2. Current load (circuit breaker state)
// 3. Latency characteristics
capable.iter()
.max_by_key(|b| self.score_backend(b, &operation))
.copied()
}

fn score_backend(&self, backend: &Backend, op: &Operation) -> u32 {
let mut score = backend.priority * 100;

// Penalty for high latency
if backend.p99_latency > Duration::from_millis(100) {
score -= 50;
}

// Penalty for circuit breaker issues
if backend.circuit_breaker.is_open() {
score -= 200;
}

// Bonus for native capability support
if backend.has_native_support(op) {
score += 30;
}

score
}
}

Standardized Observability

Problem: Inconsistent Metrics Across Patterns

Current State: Each pattern exports different metrics.

# KeyValue pattern metrics
prism_keyvalue_operations_total
prism_keyvalue_latency_seconds

# PubSub pattern metrics (different naming)
prism_pubsub_messages_total
prism_pubsub_publish_duration_seconds

# Inconsistent labels, naming conventions

Netflix's Approach: Platform-Wide Observability

Key Insight: Standardize metrics, tracing, and logging across all abstractions.

Solution: Standardized Observability Schema

Metrics Schema:

# Standard metrics for ALL patterns
observability:
metrics:
# Request metrics (standard across all patterns)
- name: prism_requests_total
type: counter
labels: [namespace, pattern, operation, status]

- name: prism_request_duration_seconds
type: histogram
labels: [namespace, pattern, operation]
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]

# Backend metrics (standard for all backends)
- name: prism_backend_requests_total
type: counter
labels: [namespace, backend_type, operation, status]

- name: prism_backend_duration_seconds
type: histogram
labels: [namespace, backend_type, operation]

# Cost metrics (NEW)
- name: prism_storage_bytes
type: gauge
labels: [namespace, tier, backend_type]

- name: prism_storage_cost_usd_per_month
type: gauge
labels: [namespace, tier]

- name: prism_operation_cost_usd_per_month
type: gauge
labels: [namespace, operation_type]

# Resilience metrics (NEW)
- name: prism_circuit_breaker_state
type: gauge
labels: [namespace, backend_type]
values: {open: 0, half_open: 1, closed: 2}

- name: prism_requests_shed_total
type: counter
labels: [namespace, reason]

- name: prism_queue_depth
type: gauge
labels: [namespace, queue_type]

Distributed Tracing (Standard):

Trace: Store session:123
├─ Span: ProxyService.Store [15ms]
│ ├─ Span: ResilienceLayer.check [0.1ms]
│ │ ├─ Tag: circuit_breaker_state=closed
│ │ ├─ Tag: shed_probability=0.0
│ │ └─ Tag: queue_depth=150
│ ├─ Span: LifecycleManager.apply_ttl [0.2ms]
│ │ ├─ Tag: ttl_policy=session:*
│ │ └─ Tag: computed_ttl=3600
│ ├─ Span: CapabilityRouter.select_backend [0.1ms]
│ │ ├─ Tag: backends_evaluated=2
│ │ └─ Tag: selected_backend=redis
│ ├─ Span: KeyValuePattern.store [10ms]
│ │ ├─ Span: RedisBackend.SET [5ms]
│ │ │ ├─ Tag: backend_type=redis
│ │ │ ├─ Tag: operation=SET
│ │ │ └─ Tag: bytes_written=1024
│ │ └─ Span: PostgresBackend.INSERT [5ms]
│ │ ├─ Tag: backend_type=postgres
│ │ ├─ Tag: operation=INSERT
│ │ └─ Tag: bytes_written=1024
│ └─ Span: CostTracking.record [0.1ms]
│ ├─ Tag: operation_cost_usd=0.0000004
│ └─ Tag: storage_cost_usd_per_gb=0.023
└─ Total: 15ms, Cost: $0.0000004

Log Structure (Standard):

{
"timestamp": "2025-11-14T10:30:00.123Z",
"level": "info",
"message": "KeyValue store operation succeeded",
"namespace": "user-sessions",
"pattern": "keyvalue",
"operation": "store",
"key": "session:123",
"backend_type": "redis",
"duration_ms": 1.2,
"bytes_written": 1024,
"ttl_seconds": 3600,
"cost_usd": 0.0000004,
"trace_id": "abc123",
"span_id": "def456",
"circuit_breaker_state": "closed",
"queue_depth": 150
}

Implementation Roadmap

Phase 1: Platform Resilience (4 weeks)

Week 1-2: Circuit Breaker Layer

  • Implement per-backend circuit breakers in Rust proxy
  • Add configuration schema for circuit breaker policies
  • Integration tests with backend failures
  • Metrics: prism_circuit_breaker_state, prism_circuit_breaker_trips_total

Week 3-4: Load Shedding + Backpressure

  • Implement adaptive load shedding based on latency
  • Add queue depth monitoring and backpressure
  • Load testing to validate shed thresholds
  • Metrics: prism_requests_shed_total, prism_queue_depth

Deliverables:

  • Circuit breaker middleware in Rust proxy
  • Load shedding configuration per namespace
  • Grafana dashboard for resilience metrics

Phase 2: Data Lifecycle Management (4 weeks)

Week 1-2: TTL Policies

  • Add lifecycle configuration schema
  • Implement default TTL enforcement in patterns
  • Pattern-specific TTL override logic
  • Metrics: prism_keys_expired_total, prism_ttl_violations_total

Week 3-4: Automated Tiering

  • Background job for tiering (hot → warm → cold)
  • S3 integration for cold storage
  • Cost tracking per tier
  • Metrics: prism_storage_bytes{tier}, prism_tier_migrations_total

Deliverables:

  • Lifecycle policy configuration
  • Automated tiering job
  • Cost tracking dashboard

Phase 3: Multi-Modal Access (6 weeks)

Week 1-2: Multi-Modal Router

  • Design multi-modal namespace configuration
  • Implement pattern routing logic
  • Add secondary pattern support to proxy

Week 3-4: Materialized Views

  • Implement materialized view generation (Postgres)
  • Background refresh for views
  • Consistency guarantees

Week 5-6: Analytics Export

  • S3 Parquet export job
  • Scheduled export (daily/hourly)
  • Partition management

Deliverables:

  • Multi-modal namespace configuration
  • Materialized view support
  • Parquet export to S3

Phase 4: Capability-Aware Routing (4 weeks)

Week 1-2: Capability System

  • Define backend capability taxonomy
  • Implement capability detection for backends
  • Scoring function for backend selection

Week 3-4: Dynamic Routing

  • Implement operation-specific routing
  • Add routing rules configuration
  • Load testing for multi-backend namespaces

Deliverables:

  • Capability-aware router
  • Multi-backend namespace support
  • Routing rules configuration

Phase 5: Standardized Observability (3 weeks)

Week 1: Metrics Standardization

  • Migrate all patterns to standard metrics schema
  • Add cost metrics to patterns
  • Update Prometheus recording rules

Week 2: Tracing Enhancement

  • Add standard span tags across patterns
  • Implement cost tracking in spans
  • Integration with OpenTelemetry

Week 3: Dashboards + Alerts

  • Unified Grafana dashboards
  • Cost tracking dashboard
  • Standard alert rules for resilience

Deliverables:

  • Standard metrics across all patterns
  • Unified observability dashboards
  • Alert runbooks

Configuration Examples

Example 1: Production KeyValue with Full Resilience

namespaces:
- name: user-sessions-prod
pattern: keyvalue

# Backends
backends:
- name: cache
type: redis
address: redis-cluster.prod:6379
priority: 1

- name: primary
type: postgres
address: postgres.prod:5432
database: sessions
priority: 2

# Routing
routing:
read_operations:
- try: cache
on_miss: primary
write_operations:
- targets: [primary, cache]
consistency: write_through

# Resilience (NEW)
resilience:
circuit_breaker:
enabled: true
error_threshold: 0.5
min_requests: 100
timeout: 30s

adaptive_load_shedding:
enabled: true
target_latency_p99: 100ms
shed_probability: 0.5

backpressure:
enabled: true
queue_depth_threshold: 1000

# Lifecycle (NEW)
lifecycle:
default_ttl: 3600
ttl_policies:
- key_pattern: "session:*"
ttl: 3600
- key_pattern: "user:*:profile"
ttl: 86400

cleanup:
enabled: true
scan_interval: 3600

cost_tracking:
enabled: true
storage_cost_per_gb_month: 0.023

tiering:
enabled: true
hot_tier:
backend: cache
max_age: 86400
warm_tier:
backend: primary
retention: 604800

Example 2: Multi-Modal TimeSeries + Counter

namespaces:
- name: user-events-multi

# Primary pattern (writes)
primary_pattern: timeseries
backend:
type: postgres
table: user_events

# Secondary patterns (reads) - NEW
secondary_patterns:
# Counter aggregation view
- pattern: counter
mode: eventually_consistent_global
aggregation:
group_by: [user_id, event_type]
window: 3600
refresh_interval: 300

# KeyValue lookup view
- pattern: keyvalue
key_template: "user:{user_id}:events"
materialized_view: true
refresh: continuous

# Analytics export
- pattern: analytics_export
destination: s3
bucket: prism-analytics
format: parquet
partition_by: [date, event_type]
schedule: "0 2 * * *"

# Lifecycle
lifecycle:
default_ttl: 2592000 # 30 days

tiering:
enabled: true
hot_tier:
backend: postgres
max_age: 86400 # 1 day
cold_tier:
backend: s3
retention: 2592000 # 30 days
storage_class: GLACIER

Comparison to Netflix

FeatureNetflix Data GatewayPrism (Current)Prism (After RFC-054)
Platform ResilienceCircuit breakers, load sheddingPer-pattern resilience✅ Platform-level
Data LifecycleTTL, Snapshot JanitorsManual cleanup✅ Automated policies
Multi-Modal AccessKV, TimeSeries, Counter on same dataSingle pattern/namespace✅ Secondary patterns
Cost TrackingPer-namespace attributionNo tracking✅ Built-in
Backend SelectionStatic configurationStatic configuration✅ Capability-aware
ObservabilityStandardized metricsPattern-specific✅ Standardized

Success Criteria

  1. Resilience: 99.99% uptime across all namespaces (Netflix-level SLO)
  2. Cost Reduction: 30% storage cost reduction via automated tiering
  3. Developer Experience: Zero manual TTL management required
  4. Performance: P99 latency <100ms with load shedding
  5. Observability: Unified dashboards for all patterns

Open Questions

  1. Lifecycle Defaults: Should all namespaces have default TTL, or opt-in?
  2. Cost Attribution: Track at namespace or per-key granularity?
  3. Multi-Modal Consistency: Eventual or strong consistency between pattern views?
  4. Backend Failover: Automatic or manual failover when circuit breaker opens?
  5. Tiering Performance: Acceptable latency increase for warm/cold tier reads?

References

Netflix Documentation

External References

Revision History

  • 2025-11-14: Initial draft inspired by Netflix Data Gateway architecture