architecturenetflixresilienceplatformabstractionsdata-lifecycle

Status: ProposedAuthor: Platform TeamCreated: Nov 14, 2025Updated: Nov 14, 2025

RFC-054: Netflix-Inspired Platform Enhancements for Prism

Status: Proposed Author: Platform Team Created: 2025-11-14 Updated: 2025-11-14

Abstract

This RFC proposes platform-level enhancements to Prism inspired by Netflix's Data Gateway architecture and operational learnings. Netflix's decade-plus experience running a data abstraction layer at global scale provides valuable lessons for making Prism more resilient, cost-efficient, and developer-friendly.

Key Enhancements:

Platform-Level Resilience - Circuit breakers, adaptive load shedding, backpressure at proxy layer
Automated Data Lifecycle Management - TTL policies, automated cleanup, cost tracking per namespace
Multi-Modal Access Patterns - Same data accessible through different pattern abstractions
Capability-Aware Routing - Intelligent backend selection based on operation requirements
Standardized Observability - Platform-wide metrics, tracing, and cost attribution

Netflix demonstrated that a well-designed platform layer can shield applications from database complexity while maintaining extreme reliability (99.99%+ uptime) at massive scale (petabytes of data, billions of operations/day).

Motivation

Netflix's Data Gateway Success

Netflix's Data Gateway platform provides several specialized abstractions:

Key-Value DAL: Simple CRUD operations with multi-backend support (Cassandra, EVCache, DynamoDB)
TimeSeries: Temporal event data with time-based queries (Cassandra, Elasticsearch)
Distributed Counter: Counting with configurable accuracy/latency trade-offs (EVCache, Cassandra)
Write-Ahead Log: Durable event logging with guaranteed ordering (already in RFC-051)

Key Characteristics:

Abstraction Platform - Common infrastructure for multiple data access patterns
Resilience First - Circuit breakers, load shedding, backpressure built into platform
Data Lifecycle - TTL, cleanup, and cost management as first-class concerns
Operational Excellence - Automated capacity planning, incident learning, best practice enforcement

Current Prism Architecture

Prism has similar pattern abstractions (RFC-014, RFC-046):

KeyValue Pattern - Key-value storage with backend slots
PubSub/Queue Patterns - Messaging with producer/consumer abstractions
Multicast Registry - Identity registration with selective multicast
Graph Pattern - Graph traversal and algorithms (RFC-053)
Session Store - Distributed session management

What's Missing:

Platform-level resilience - Patterns implement their own circuit breaking
Standardized data lifecycle - Each pattern handles TTL differently
Cost visibility - No unified cost tracking across patterns
Multi-modal access - Can't access same data through different patterns
Capability negotiation - Limited backend selection intelligence

Goals

Add Platform-Level Resilience - Circuit breakers, load shedding, backpressure in proxy
Automate Data Lifecycle - TTL policies, cleanup automation, cost tracking
Enable Multi-Modal Access - Same backend data through multiple pattern views
Intelligent Backend Selection - Route operations to optimal backend based on capabilities
Standardize Observability - Unified metrics, tracing, cost attribution

Non-Goals

Replacing existing pattern designs (build on top of RFC-014, RFC-046)
Changing protobuf APIs (additive enhancements only)
Requiring application code changes (platform improvements are transparent)

Platform-Level Resilience

Problem: Resilience Currently Per-Pattern

Current State: Each pattern implements its own resilience:

// patterns/keyvalue/pattern.go
func (kv *KeyValuePattern) Store(ctx context.Context, req *StoreRequest) (*StoreResponse, error) {
    // Pattern-specific circuit breaker
    if kv.circuitBreaker.IsOpen() {
        return nil, errors.New("circuit breaker open")
    }

    // Pattern-specific retry logic
    return kv.backend.Set(ctx, req.Key, req.Value)
}

Problems:

Duplicate circuit breaker implementations across patterns
Inconsistent behavior (different timeout/retry configs)
No cross-pattern coordination (all patterns fail independently)
Difficult to configure (must set per-pattern)

Solution: Platform-Level Resilience Layer

Architecture:

┌─────────────────────────────────────────────────────────┐
│                   Client Application                     │
└────────────────────┬────────────────────────────────────┘
                     │ gRPC request
                     ▼
┌─────────────────────────────────────────────────────────┐
│              Prism Proxy (Rust)                         │
│  ┌───────────────────────────────────────────────────┐  │
│  │  Platform Resilience Layer (NEW)                  │  │
│  │                                                    │  │
│  │  1. Circuit Breaker (per backend + global)       │  │
│  │  2. Adaptive Load Shedding (based on latency)    │  │
│  │  3. Backpressure (queue depth monitoring)        │  │
│  │  4. Request Hedging (parallel requests)          │  │
│  │  5. Timeout Management (operation-specific)      │  │
│  └───────────────────────────────────────────────────┘  │
│                     │                                    │
│                     ▼                                    │
│  ┌───────────────────────────────────────────────────┐  │
│  │  Pattern Layer (KeyValue, PubSub, etc.)          │  │
│  │  - Resilience handled by platform                 │  │
│  │  - Focus on business logic only                   │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Circuit Breaker Configuration

Per-Backend Circuit Breakers:

namespaces:
  - name: user-sessions
    pattern: keyvalue
    backend:
      type: redis

    # Platform-level resilience (NEW)
    resilience:
      circuit_breaker:
        enabled: true
        error_threshold: 50          # Open after 50% error rate
        min_requests: 100            # Minimum requests before opening
        timeout: 30s                 # Stay open for 30s
        half_open_requests: 10       # Test with 10 requests when half-open

      adaptive_load_shedding:
        enabled: true
        target_latency_p99: 100ms    # Target P99 latency
        shed_probability: 0.5        # Start shedding at 50% over target
        min_accept_rate: 0.1         # Always accept 10% of requests

      backpressure:
        enabled: true
        queue_depth_threshold: 1000  # Start rejecting if queue > 1000
        queue_latency_threshold: 5s  # Reject if queue time > 5s

      timeouts:
        operation_timeout: 5s        # Per-operation timeout
        connection_timeout: 1s       # Backend connection timeout

Global Circuit Breaker:

# Proxy-level configuration
proxy:
  global_resilience:
    circuit_breaker:
      enabled: true
      # Open if ANY backend has >80% error rate
      global_error_threshold: 0.8
      # Fail fast for all namespaces
      fail_fast_on_global_open: true

Adaptive Load Shedding (Netflix Pattern)

Problem: Traffic spikes saturate backend, causing cascading failures.

Netflix Solution: Shed load based on latency feedback loops.

Implementation:

// pkg/proxy/resilience/load_shedding.rs
pub struct AdaptiveLoadShedder {
    target_latency: Duration,
    current_latency: Arc<AtomicU64>,  // P99 latency
    shed_probability: Arc<AtomicU64>, // 0-100%
}

impl AdaptiveLoadShedder {
    pub fn should_accept_request(&self) -> bool {
        let current_p99 = Duration::from_nanos(
            self.current_latency.load(Ordering::Relaxed)
        );

        if current_p99 <= self.target_latency {
            return true; // Fast path: latency good, accept all
        }

        // Latency over target: start shedding
        let overage_ratio = current_p99.as_millis() as f64
                          / self.target_latency.as_millis() as f64;

        // Shed probability increases linearly with overage
        // 2× target latency = 50% shed, 3× = 66% shed, etc.
        let shed_prob = 1.0 - (1.0 / overage_ratio);

        // Random accept/reject based on shed probability
        rand::random::<f64>() > shed_prob
    }

    pub fn update_latency(&self, latency: Duration) {
        // Update P99 latency with exponential moving average
        self.current_latency.store(
            latency.as_nanos() as u64,
            Ordering::Relaxed
        );
    }
}

Client Experience:

Normal conditions (P99 < 100ms):
  - All requests accepted ✅

Moderate load (P99 = 200ms, 2× target):
  - 50% of requests shed ⚠️
  - Error: 429 Too Many Requests (shed_reason: "latency_overload")

High load (P99 = 500ms, 5× target):
  - 80% of requests shed 🚨
  - Backend protected from saturation
  - Error: 429 Too Many Requests (shed_reason: "latency_overload")

Backpressure Monitoring

Problem: Proxy queue builds up, requests timeout before execution.

Solution: Monitor queue depth and latency, reject early.

// pkg/proxy/resilience/backpressure.rs
pub struct BackpressureMonitor {
    queue_depth: Arc<AtomicUsize>,
    max_queue_depth: usize,
    queue_wait_times: Arc<Mutex<VecDeque<Duration>>>,
}

impl BackpressureMonitor {
    pub fn check_accept_request(&self) -> Result<(), Error> {
        // Check queue depth
        let depth = self.queue_depth.load(Ordering::Relaxed);
        if depth > self.max_queue_depth {
            return Err(Error::QueueFull {
                current_depth: depth,
                max_depth: self.max_queue_depth,
            });
        }

        // Check queue latency (median of last 100 requests)
        let wait_times = self.queue_wait_times.lock().unwrap();
        let median_wait = calculate_median(&wait_times);

        if median_wait > Duration::from_secs(5) {
            return Err(Error::QueueLatencyTooHigh {
                median_wait,
                threshold: Duration::from_secs(5),
            });
        }

        Ok(())
    }
}

Metrics:

# Queue depth (target: <1000)
prism_queue_depth{namespace="user-sessions"} 850

# Queue wait time (target: <1s)
prism_queue_wait_seconds{namespace="user-sessions", quantile="0.5"} 0.2
prism_queue_wait_seconds{namespace="user-sessions", quantile="0.99"} 3.5

# Rejection rate due to backpressure
rate(prism_requests_rejected_total{reason="queue_full"}[5m]) 120/sec

Automated Data Lifecycle Management

Problem: Manual Data Cleanup

Current State: Applications must implement their own cleanup:

# Application code (bad - manual cleanup)
client.keyvalue.store(key="session:123", value=data)

# Remember to clean up later... (often forgotten)
schedule_cleanup("session:123", after=3600)  # 1 hour

Problems:

Developers forget to set TTL
Data accumulates indefinitely
Storage costs balloon
Performance degrades (scan operations slow)

Netflix's Approach: TTL as First-Class Concern

Key Insight: Data cleanup should be automatic and declarative, not manual.

Netflix Strategies:

Default TTL policies - Every key has a TTL unless explicitly persistent
Snapshot Janitors - Automated cleanup of old snapshots
Cost attribution - Track storage costs per namespace
Automated tiering - Move old data to cheaper storage

Solution: Declarative Data Lifecycle Policies

Configuration:

namespaces:
  - name: user-sessions
    pattern: keyvalue

    # Data lifecycle policies (NEW)
    lifecycle:
      default_ttl: 3600                # 1 hour default for all keys

      # Pattern-specific overrides
      ttl_policies:
        - key_pattern: "session:*"
          ttl: 3600                    # 1 hour for sessions

        - key_pattern: "cache:*"
          ttl: 300                     # 5 min for cache

        - key_pattern: "user:*:profile"
          ttl: 86400                   # 1 day for profiles

        - key_pattern: "config:*"
          ttl: -1                      # Persistent for config

      # Automated cleanup
      cleanup:
        enabled: true
        scan_interval: 3600            # Scan every hour
        batch_size: 1000               # Delete 1000 keys per batch

      # Cost tracking
      cost_tracking:
        enabled: true
        storage_cost_per_gb_month: 0.023  # $0.023/GB/month
        operation_cost_per_million: 0.40  # $0.40/million ops

      # Tiered storage (Netflix pattern)
      tiering:
        enabled: true
        hot_tier:
          backend: redis               # Hot: Redis
          max_age: 86400               # 1 day

        warm_tier:
          backend: postgres            # Warm: Postgres
          max_age: 604800              # 7 days

        cold_tier:
          backend: s3                  # Cold: S3
          retention: 2592000           # 30 days, then delete

Client API (TTL is automatic):

# Application code (good - TTL automatic)
client.keyvalue.store(
    key="session:123",
    value=data
    # No TTL needed - uses lifecycle policy "session:*" -> 1 hour
)

# Explicit TTL override if needed
client.keyvalue.store(
    key="session:123",
    value=data,
    ttl=7200  # 2 hours (overrides policy)
)

Automated Tiering Example

Scenario: Session data accessed frequently for 1 day, occasionally for 7 days, rarely after.

Configuration:

lifecycle:
  tiering:
    enabled: true

    # Day 0-1: Hot tier (Redis)
    hot_tier:
      backend: redis
      max_age: 86400
      replication: 3

    # Day 1-7: Warm tier (Postgres)
    warm_tier:
      backend: postgres
      max_age: 604800

    # Day 7-30: Cold tier (S3)
    cold_tier:
      backend: s3
      retention: 2592000
      storage_class: GLACIER  # Cheap archival

    # Day 30+: Delete
    delete_after: 2592000

Data Flow:

Day 0: Write → Redis (hot tier)
  - P99 latency: 1ms
  - Cost: $0.10/GB/month

Day 1: Background job moves to Postgres (warm tier)
  - P99 latency: 10ms
  - Cost: $0.05/GB/month

Day 7: Background job moves to S3 (cold tier)
  - P99 latency: 100ms
  - Cost: $0.004/GB/month (Glacier)

Day 30: Automated deletion
  - Cost: $0

Cost Tracking and Attribution

Metrics:

# Storage cost per namespace
prism_storage_cost_usd_per_month{namespace="user-sessions"} 1250.00

# Operation cost per namespace
prism_operation_cost_usd_per_month{namespace="user-sessions"} 350.00

# Total cost breakdown
prism_total_cost_usd_per_month{
  namespace="user-sessions",
  cost_type="storage"
} 1250.00

prism_total_cost_usd_per_month{
  namespace="user-sessions",
  cost_type="operations"
} 350.00

# Storage by tier
prism_storage_bytes{namespace="user-sessions", tier="hot"} 500GB
prism_storage_bytes{namespace="user-sessions", tier="warm"} 2TB
prism_storage_bytes{namespace="user-sessions", tier="cold"} 10TB

Cost Dashboard:

Namespace: user-sessions
├── Storage Costs: $1,250/month
│   ├── Hot (Redis): $500/month (500GB)
│   ├── Warm (Postgres): $250/month (2TB)
│   └── Cold (S3 Glacier): $500/month (10TB)
├── Operation Costs: $350/month
│   ├── Reads: $200/month (500M/month)
│   └── Writes: $150/month (375M/month)
└── Total: $1,600/month

Recommendations:
  ⚠️  Consider increasing hot_tier TTL to reduce tiering overhead
  ⚠️  96% of reads in hot tier (good caching)
  ⚠️  Cold tier has 85% unused data → reduce retention to 15 days

Problem: Single Access Pattern Per Backend

Current State: Each namespace has one pattern.

# Can ONLY access as KeyValue
namespaces:
  - name: user-profiles
    pattern: keyvalue
    backend:
      type: postgres

Limitations:

Same data can't be queried differently
Must duplicate data to support multiple access patterns
No way to run analytics on operational data

Netflix's Approach: Multiple Abstractions Over Same Data

Example: TimeSeries data accessible as:

TimeSeries Pattern - Time-range queries for recent events
Counter Pattern - Aggregated counts with different accuracy modes
Analytics - Export to Parquet on S3 for Spark jobs

Configuration:

namespaces:
  - name: user-events

    # Primary pattern (write path)
    primary_pattern: timeseries
    backend:
      type: postgres
      table: user_events

    # Secondary patterns (read views) - NEW
    secondary_patterns:
      # View 1: Counter aggregation
      - pattern: counter
        mode: eventually_consistent_global
        aggregation:
          group_by: [user_id, event_type]
          window: 3600  # 1 hour windows

      # View 2: KeyValue lookup
      - pattern: keyvalue
        key_template: "user:{user_id}:events"
        materialized_view: true  # Precompute for fast lookup

      # View 3: Analytics export
      - pattern: analytics_export
        destination: s3
        format: parquet
        partition_by: [date, event_type]
        schedule: "0 2 * * *"  # Daily at 2 AM

Client Usage:

# Write via primary pattern (TimeSeries)
client.timeseries.write(
    namespace="user-events",
    timestamp=now(),
    event_type="play_video",
    user_id="user:123",
    data={"video_id": "v456", "position": 120}
)

# Read via Counter pattern (aggregated)
count = client.counter.get(
    namespace="user-events",
    key="user:123",
    event_type="play_video",
    time_range="last_24h"
)
print(f"Videos played: {count}")  # 15

# Read via KeyValue pattern (latest events)
events = client.keyvalue.retrieve(
    namespace="user-events",
    key="user:user:123:events"
)
print(f"Last 10 events: {events[:10]}")

# Analytics via S3 export
# spark.read.parquet("s3://prism-analytics/user-events/date=2025-11-14/")

Benefits:

Write once, read many ways
No data duplication
Pattern-specific optimizations
Cost-effective (single storage, multiple views)

┌────────────────────────────────────────────────────────┐
│                 Client Applications                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │TimeSeries│  │ Counter  │  │ KeyValue │           │
│  │  Client  │  │  Client  │  │  Client  │           │
│  └──────────┘  └──────────┘  └──────────┘           │
└───────┬────────────┬──────────────┬──────────────────┘
        │            │              │
        │ write      │ read         │ read
        ▼            ▼              ▼
┌────────────────────────────────────────────────────────┐
│               Prism Proxy (Rust)                       │
│  ┌──────────────────────────────────────────────────┐ │
│  │  Multi-Modal Router (NEW)                        │ │
│  │  - Routes to primary pattern for writes          │ │
│  │  - Routes to secondary patterns for reads        │ │
│  │  - Maintains consistency guarantees              │ │
│  └──────────────────────────────────────────────────┘ │
└────────────────────┬───────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  PostgreSQL Backend   │
         │  ┌─────────────────┐  │
         │  │  user_events    │  │  ← Single table
         │  │  (primary)      │  │
         │  └─────────────────┘  │
         │  ┌─────────────────┐  │
         │  │  Materialized   │  │  ← Counter view
         │  │  View: counts   │  │
         │  └─────────────────┘  │
         │  ┌─────────────────┐  │
         │  │  Materialized   │  │  ← KeyValue view
         │  │  View: latest   │  │
         │  └─────────────────┘  │
         └───────────────────────┘

Capability-Aware Backend Selection

Problem: Static Backend Assignment

Current State: Backend chosen at namespace creation.

namespaces:
  - name: user-sessions
    pattern: keyvalue
    backend:
      type: redis  # Fixed choice

Limitations:

Can't optimize per-operation (read vs write)
Can't leverage multiple backends' strengths
Can't adapt to changing workload

Netflix's Approach: Intelligent Backend Selection

Example: EVCache for hot reads, Cassandra for durability.

Solution: Capability-Aware Routing

Configuration:

namespaces:
  - name: user-sessions
    pattern: keyvalue

    # Multi-backend configuration (NEW)
    backends:
      # Hot read cache
      - name: cache
        type: redis
        capabilities: [fast_read, ttl]
        priority: 1  # Try first for reads

      # Durable storage
      - name: primary
        type: postgres
        capabilities: [durable_write, transactions, scan]
        priority: 2  # Fallback for all ops

      # Write-through to both
      write_strategy: write_through

    # Routing rules (NEW)
    routing:
      # Reads: Try cache first, fallback to primary
      read_operations:
        - try: cache
          on_miss: primary

      # Writes: Write to both (durability + cache)
      write_operations:
        - targets: [primary, cache]
          consistency: write_through

      # Scans: Only primary supports this
      scan_operations:
        - try: primary

Operation Routing:

# Read operation
value = client.keyvalue.retrieve(key="session:123")
# 1. Try Redis (cache) → Hit (1ms) ✅
# 2. If miss, query Postgres (primary) → (10ms)
# 3. Populate Redis for next read

# Write operation
client.keyvalue.store(key="session:123", value=data)
# 1. Write to Postgres (primary, durable) → (10ms)
# 2. Write to Redis (cache) → (1ms)
# 3. Acknowledge after both succeed

# Scan operation
keys = client.keyvalue.scan(prefix="session:")
# 1. Route to Postgres (only backend with scan) → (50ms)

Backend Capability Matrix

Backend	Capabilities	Best For
Redis	fast_read, fast_write, ttl, pubsub	Hot data, caching, real-time
Postgres	durable_write, transactions, scan, complex_query	Durable data, ACID, analytics
S3	cheap_storage, large_objects	Archival, backups, bulk data
Kafka	ordered_stream, replay, high_throughput	Events, WAL, streaming
Neptune	graph_traversal, complex_relationships	Connected data, recommendations

Routing Intelligence:

// pkg/proxy/routing/capability_router.rs
pub struct CapabilityRouter {
    backends: Vec<Backend>,
}

impl CapabilityRouter {
    pub fn select_backend(&self, operation: Operation) -> Option<&Backend> {
        let required_caps = operation.required_capabilities();

        // Find backends that satisfy all required capabilities
        let capable = self.backends.iter()
            .filter(|b| b.has_capabilities(&required_caps))
            .collect::<Vec<_>>();

        if capable.is_empty() {
            return None; // No backend can handle this
        }

        // Select best backend based on:
        // 1. Priority (admin-configured)
        // 2. Current load (circuit breaker state)
        // 3. Latency characteristics
        capable.iter()
            .max_by_key(|b| self.score_backend(b, &operation))
            .copied()
    }

    fn score_backend(&self, backend: &Backend, op: &Operation) -> u32 {
        let mut score = backend.priority * 100;

        // Penalty for high latency
        if backend.p99_latency > Duration::from_millis(100) {
            score -= 50;
        }

        // Penalty for circuit breaker issues
        if backend.circuit_breaker.is_open() {
            score -= 200;
        }

        // Bonus for native capability support
        if backend.has_native_support(op) {
            score += 30;
        }

        score
    }
}

Standardized Observability

Problem: Inconsistent Metrics Across Patterns

Current State: Each pattern exports different metrics.

# KeyValue pattern metrics
prism_keyvalue_operations_total
prism_keyvalue_latency_seconds

# PubSub pattern metrics (different naming)
prism_pubsub_messages_total
prism_pubsub_publish_duration_seconds

# Inconsistent labels, naming conventions

Netflix's Approach: Platform-Wide Observability

Key Insight: Standardize metrics, tracing, and logging across all abstractions.

Solution: Standardized Observability Schema

Metrics Schema:

# Standard metrics for ALL patterns
observability:
  metrics:
    # Request metrics (standard across all patterns)
    - name: prism_requests_total
      type: counter
      labels: [namespace, pattern, operation, status]

    - name: prism_request_duration_seconds
      type: histogram
      labels: [namespace, pattern, operation]
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]

    # Backend metrics (standard for all backends)
    - name: prism_backend_requests_total
      type: counter
      labels: [namespace, backend_type, operation, status]

    - name: prism_backend_duration_seconds
      type: histogram
      labels: [namespace, backend_type, operation]

    # Cost metrics (NEW)
    - name: prism_storage_bytes
      type: gauge
      labels: [namespace, tier, backend_type]

    - name: prism_storage_cost_usd_per_month
      type: gauge
      labels: [namespace, tier]

    - name: prism_operation_cost_usd_per_month
      type: gauge
      labels: [namespace, operation_type]

    # Resilience metrics (NEW)
    - name: prism_circuit_breaker_state
      type: gauge
      labels: [namespace, backend_type]
      values: {open: 0, half_open: 1, closed: 2}

    - name: prism_requests_shed_total
      type: counter
      labels: [namespace, reason]

    - name: prism_queue_depth
      type: gauge
      labels: [namespace, queue_type]

Distributed Tracing (Standard):

Trace: Store session:123
├─ Span: ProxyService.Store [15ms]
│  ├─ Span: ResilienceLayer.check [0.1ms]
│  │  ├─ Tag: circuit_breaker_state=closed
│  │  ├─ Tag: shed_probability=0.0
│  │  └─ Tag: queue_depth=150
│  ├─ Span: LifecycleManager.apply_ttl [0.2ms]
│  │  ├─ Tag: ttl_policy=session:*
│  │  └─ Tag: computed_ttl=3600
│  ├─ Span: CapabilityRouter.select_backend [0.1ms]
│  │  ├─ Tag: backends_evaluated=2
│  │  └─ Tag: selected_backend=redis
│  ├─ Span: KeyValuePattern.store [10ms]
│  │  ├─ Span: RedisBackend.SET [5ms]
│  │  │  ├─ Tag: backend_type=redis
│  │  │  ├─ Tag: operation=SET
│  │  │  └─ Tag: bytes_written=1024
│  │  └─ Span: PostgresBackend.INSERT [5ms]
│  │     ├─ Tag: backend_type=postgres
│  │     ├─ Tag: operation=INSERT
│  │     └─ Tag: bytes_written=1024
│  └─ Span: CostTracking.record [0.1ms]
│     ├─ Tag: operation_cost_usd=0.0000004
│     └─ Tag: storage_cost_usd_per_gb=0.023
└─ Total: 15ms, Cost: $0.0000004

Log Structure (Standard):

{
  "timestamp": "2025-11-14T10:30:00.123Z",
  "level": "info",
  "message": "KeyValue store operation succeeded",
  "namespace": "user-sessions",
  "pattern": "keyvalue",
  "operation": "store",
  "key": "session:123",
  "backend_type": "redis",
  "duration_ms": 1.2,
  "bytes_written": 1024,
  "ttl_seconds": 3600,
  "cost_usd": 0.0000004,
  "trace_id": "abc123",
  "span_id": "def456",
  "circuit_breaker_state": "closed",
  "queue_depth": 150
}

Implementation Roadmap

Phase 1: Platform Resilience (4 weeks)

Week 1-2: Circuit Breaker Layer

Implement per-backend circuit breakers in Rust proxy
Add configuration schema for circuit breaker policies
Integration tests with backend failures
Metrics: prism_circuit_breaker_state, prism_circuit_breaker_trips_total

Week 3-4: Load Shedding + Backpressure

Implement adaptive load shedding based on latency
Add queue depth monitoring and backpressure
Load testing to validate shed thresholds
Metrics: prism_requests_shed_total, prism_queue_depth

Deliverables:

Circuit breaker middleware in Rust proxy
Load shedding configuration per namespace
Grafana dashboard for resilience metrics

Phase 2: Data Lifecycle Management (4 weeks)

Week 1-2: TTL Policies

Add lifecycle configuration schema
Implement default TTL enforcement in patterns
Pattern-specific TTL override logic
Metrics: prism_keys_expired_total, prism_ttl_violations_total

Week 3-4: Automated Tiering

Background job for tiering (hot → warm → cold)
S3 integration for cold storage
Cost tracking per tier
Metrics: prism_storage_bytes{tier}, prism_tier_migrations_total

Deliverables:

Lifecycle policy configuration
Automated tiering job
Cost tracking dashboard

Week 1-2: Multi-Modal Router

Design multi-modal namespace configuration
Implement pattern routing logic
Add secondary pattern support to proxy

Week 3-4: Materialized Views

Implement materialized view generation (Postgres)
Background refresh for views
Consistency guarantees

Week 5-6: Analytics Export

S3 Parquet export job
Scheduled export (daily/hourly)
Partition management

Deliverables:

Multi-modal namespace configuration
Materialized view support
Parquet export to S3

Phase 4: Capability-Aware Routing (4 weeks)

Week 1-2: Capability System

Define backend capability taxonomy
Implement capability detection for backends
Scoring function for backend selection

Week 3-4: Dynamic Routing

Implement operation-specific routing
Add routing rules configuration
Load testing for multi-backend namespaces

Deliverables:

Capability-aware router
Multi-backend namespace support
Routing rules configuration

Phase 5: Standardized Observability (3 weeks)

Week 1: Metrics Standardization

Migrate all patterns to standard metrics schema
Add cost metrics to patterns
Update Prometheus recording rules

Week 2: Tracing Enhancement

Add standard span tags across patterns
Implement cost tracking in spans
Integration with OpenTelemetry

Week 3: Dashboards + Alerts

Unified Grafana dashboards
Cost tracking dashboard
Standard alert rules for resilience

Deliverables:

Standard metrics across all patterns
Unified observability dashboards
Alert runbooks

Configuration Examples

Example 1: Production KeyValue with Full Resilience

namespaces:
  - name: user-sessions-prod
    pattern: keyvalue

    # Backends
    backends:
      - name: cache
        type: redis
        address: redis-cluster.prod:6379
        priority: 1

      - name: primary
        type: postgres
        address: postgres.prod:5432
        database: sessions
        priority: 2

    # Routing
    routing:
      read_operations:
        - try: cache
          on_miss: primary
      write_operations:
        - targets: [primary, cache]
          consistency: write_through

    # Resilience (NEW)
    resilience:
      circuit_breaker:
        enabled: true
        error_threshold: 0.5
        min_requests: 100
        timeout: 30s

      adaptive_load_shedding:
        enabled: true
        target_latency_p99: 100ms
        shed_probability: 0.5

      backpressure:
        enabled: true
        queue_depth_threshold: 1000

    # Lifecycle (NEW)
    lifecycle:
      default_ttl: 3600
      ttl_policies:
        - key_pattern: "session:*"
          ttl: 3600
        - key_pattern: "user:*:profile"
          ttl: 86400

      cleanup:
        enabled: true
        scan_interval: 3600

      cost_tracking:
        enabled: true
        storage_cost_per_gb_month: 0.023

      tiering:
        enabled: true
        hot_tier:
          backend: cache
          max_age: 86400
        warm_tier:
          backend: primary
          retention: 604800

namespaces:
  - name: user-events-multi

    # Primary pattern (writes)
    primary_pattern: timeseries
    backend:
      type: postgres
      table: user_events

    # Secondary patterns (reads) - NEW
    secondary_patterns:
      # Counter aggregation view
      - pattern: counter
        mode: eventually_consistent_global
        aggregation:
          group_by: [user_id, event_type]
          window: 3600
          refresh_interval: 300

      # KeyValue lookup view
      - pattern: keyvalue
        key_template: "user:{user_id}:events"
        materialized_view: true
        refresh: continuous

      # Analytics export
      - pattern: analytics_export
        destination: s3
        bucket: prism-analytics
        format: parquet
        partition_by: [date, event_type]
        schedule: "0 2 * * *"

    # Lifecycle
    lifecycle:
      default_ttl: 2592000  # 30 days

      tiering:
        enabled: true
        hot_tier:
          backend: postgres
          max_age: 86400  # 1 day
        cold_tier:
          backend: s3
          retention: 2592000  # 30 days
          storage_class: GLACIER

Comparison to Netflix

Feature	Netflix Data Gateway	Prism (Current)	Prism (After RFC-054)
Platform Resilience	Circuit breakers, load shedding	Per-pattern resilience	✅ Platform-level
Data Lifecycle	TTL, Snapshot Janitors	Manual cleanup	✅ Automated policies
Multi-Modal Access	KV, TimeSeries, Counter on same data	Single pattern/namespace	✅ Secondary patterns
Cost Tracking	Per-namespace attribution	No tracking	✅ Built-in
Backend Selection	Static configuration	Static configuration	✅ Capability-aware
Observability	Standardized metrics	Pattern-specific	✅ Standardized

Success Criteria

Resilience: 99.99% uptime across all namespaces (Netflix-level SLO)
Cost Reduction: 30% storage cost reduction via automated tiering
Developer Experience: Zero manual TTL management required
Performance: P99 latency <100ms with load shedding
Observability: Unified dashboards for all patterns

RFC-014: Layered Data Access Patterns - Pattern architecture
RFC-046: Consolidated Pattern Protocols - Pattern APIs
RFC-051: Write-Ahead Log Pattern - WAL (already Netflix-inspired)
Netflix Data Gateway Summary - Netflix lessons learned
Netflix Abstractions - Netflix's pattern catalog

Open Questions

Lifecycle Defaults: Should all namespaces have default TTL, or opt-in?
Cost Attribution: Track at namespace or per-key granularity?
Multi-Modal Consistency: Eventual or strong consistency between pattern views?
Backend Failover: Automatic or manual failover when circuit breaker opens?
Tiering Performance: Acceptable latency increase for warm/cold tier reads?

Abstract​

Motivation​

Netflix's Data Gateway Success​

Current Prism Architecture​

Goals​

Non-Goals​

Platform-Level Resilience​

Problem: Resilience Currently Per-Pattern​

Solution: Platform-Level Resilience Layer​

Circuit Breaker Configuration​

Adaptive Load Shedding (Netflix Pattern)​

Backpressure Monitoring​

Automated Data Lifecycle Management​

Problem: Manual Data Cleanup​

Netflix's Approach: TTL as First-Class Concern​

Solution: Declarative Data Lifecycle Policies​

Automated Tiering Example​

Cost Tracking and Attribution​

Multi-Modal Access Patterns​

Problem: Single Access Pattern Per Backend​

Netflix's Approach: Multiple Abstractions Over Same Data​

Solution: Multi-Modal Namespace Configuration​

Multi-Modal Architecture​

Capability-Aware Backend Selection​

Problem: Static Backend Assignment​

Netflix's Approach: Intelligent Backend Selection​

Solution: Capability-Aware Routing​

Backend Capability Matrix​

Standardized Observability​

Problem: Inconsistent Metrics Across Patterns​

Netflix's Approach: Platform-Wide Observability​

Solution: Standardized Observability Schema​

Implementation Roadmap​

Phase 1: Platform Resilience (4 weeks)​

Phase 2: Data Lifecycle Management (4 weeks)​

Phase 3: Multi-Modal Access (6 weeks)​

Phase 4: Capability-Aware Routing (4 weeks)​

Phase 5: Standardized Observability (3 weeks)​

Configuration Examples​

Example 1: Production KeyValue with Full Resilience​

Example 2: Multi-Modal TimeSeries + Counter​

Comparison to Netflix​

Success Criteria​

Related Documents​

Open Questions​

References​

Netflix Documentation​

External References​

Revision History​

Abstract

Motivation

Netflix's Data Gateway Success

Current Prism Architecture

Goals

Non-Goals

Platform-Level Resilience

Problem: Resilience Currently Per-Pattern

Solution: Platform-Level Resilience Layer

Circuit Breaker Configuration

Adaptive Load Shedding (Netflix Pattern)

Backpressure Monitoring

Automated Data Lifecycle Management

Problem: Manual Data Cleanup

Netflix's Approach: TTL as First-Class Concern

Solution: Declarative Data Lifecycle Policies

Automated Tiering Example

Cost Tracking and Attribution

Multi-Modal Access Patterns

Problem: Single Access Pattern Per Backend

Netflix's Approach: Multiple Abstractions Over Same Data

Solution: Multi-Modal Namespace Configuration

Multi-Modal Architecture

Capability-Aware Backend Selection

Problem: Static Backend Assignment

Netflix's Approach: Intelligent Backend Selection

Solution: Capability-Aware Routing

Backend Capability Matrix

Standardized Observability

Problem: Inconsistent Metrics Across Patterns

Netflix's Approach: Platform-Wide Observability

Solution: Standardized Observability Schema

Implementation Roadmap

Phase 1: Platform Resilience (4 weeks)

Phase 2: Data Lifecycle Management (4 weeks)

Phase 3: Multi-Modal Access (6 weeks)

Phase 4: Capability-Aware Routing (4 weeks)

Phase 5: Standardized Observability (3 weeks)

Configuration Examples

Example 1: Production KeyValue with Full Resilience

Example 2: Multi-Modal TimeSeries + Counter

Comparison to Netflix

Success Criteria

Related Documents

Open Questions

References

Netflix Documentation

External References

Revision History