walnetflixresiliencedurabilitypatternsproductionscale

Status: ProposedAuthor: Platform TeamCreated: Nov 9, 2025

RFC-051: Write-Ahead Log Pattern - Netflix Production Learnings

Overview

This RFC documents the Write-Ahead Log (WAL) pattern for Prism, grounded in Netflix's production experience running a distributed WAL platform at massive scale. Netflix's WAL handles billions of mutations daily, serving as a resilient buffer between applications and target datastores.

Key Insight from Netflix: Traditional database WALs are tied to a single database instance. Netflix abstracted this concept into a platform service that works across any datastore, providing durability, ordering, and reliable delivery as a reusable capability.

Production Context: Netflix's WAL at Scale

Real-World Metrics

Netflix's WAL operates at production scale with impressive characteristics:

Write Latency: 1-2ms (Kafka append) vs 10-20ms (direct database writes)
Throughput: Billions of mutations per day across all namespaces
Durability: 3-way replication in Kafka survives 2 broker failures
Recovery: Crash recovery via checkpoint replay, typically <30 seconds
Availability: 99.99% uptime with automatic producer/consumer scaling

Problem Statement (Netflix's View)

Netflix identified several critical problems that WAL solves at scale:

System Entropy Management: Multiple datastores drift out of sync due to partial failures
Write Latency: Direct database writes block application threads (10-50ms)
Data Corruption Recovery: Need point-in-time replay capability for incident recovery
Traffic Spike Smoothing: Direct writes to databases cause cascading failures under load
Generic Replication: Many datastores lack built-in replication capabilities

Architecture

High-Level Flow

┌─────────────────────────────────────────────────────────┐
│                   Application                           │
│              (Write in 1-2ms, return)                   │
└─────────────────────────────────────────────────────────┘
              │
              │ write(key, value)
              ▼
┌─────────────────────────────────────────────────────────┐
│            Prism WAL Proxy                              │
│                                                         │
│  1. Append to WAL (Kafka/NATS)                         │
│  2. Return success to client  ◄─── Fast path!         │
│  3. Async apply to database                            │
│                                                         │
└─────────────────────────────────────────────────────────┘
         │                            │
         │ (fast, durable)            │ (async, replayed)
         ▼                            ▼
    ┌────────┐                   ┌──────────┐
    │ Kafka  │                   │ Postgres │
    │ (1ms)  │                   │ (10ms)   │
    └────────┘                   └──────────┘
     3-way                        Eventually
   replication                   consistent

Component Architecture

┌────────────────────────────────────────────────────────┐
│                   Client Application                   │
└────────────────────────────────────────────────────────┘
                         │
                         │ gRPC/HTTP
                         ▼
┌────────────────────────────────────────────────────────┐
│                 Prism Proxy (Rust)                     │
│                                                        │
│  ┌──────────────────────────────────────────┐        │
│  │        WAL Write Path                    │        │
│  │  1. Validate session + auth              │        │
│  │  2. Generate idempotency key             │        │
│  │  3. Compute checksum (SHA256)            │        │
│  │  4. Append to Kafka topic                │        │
│  │  5. Return wal_offset to client          │        │
│  └──────────────────────────────────────────┘        │
│                                                        │
│  ┌──────────────────────────────────────────┐        │
│  │        WAL Read Path                     │        │
│  │  1. Get consumer applied_offset          │        │
│  │  2. If data not applied, read from WAL   │        │
│  │  3. Else, read from database             │        │
│  │  4. Return data + source indicator       │        │
│  └──────────────────────────────────────────┘        │
└────────────────────────────────────────────────────────┘
           │                            │
           │ Producer                   │ Consumer
           ▼                            ▼
┌─────────────────┐         ┌──────────────────────────┐
│  Kafka Cluster  │         │   WAL Consumer Pool      │
│                 │         │                          │
│  topic:         │         │  - Poll Kafka topic      │
│   order-wal     │         │  - Validate checksums    │
│                 │         │  - Apply to database     │
│  retention:     │◄────────│  - Update checkpoint     │
│   7 days        │  poll   │  - Handle retries        │
│                 │         │  - Dead letter queue     │
│  replication: 3 │         │                          │
│  partitions: 16 │         │  Parallelism: 4 workers  │
└─────────────────┘         └──────────────────────────┘
           │                            │
           │                            │ apply
           │                            ▼
           │                   ┌──────────────────┐
           │                   │   PostgreSQL     │
           │                   │                  │
           │                   │  - Business data │
           │                   │  - Idempotency   │
           │                   │    tracking      │
           │                   └──────────────────┘
           │
           │ checkpoint
           ▼
┌─────────────────────────────────────────────────────┐
│         Checkpoint Store (SQLite/Postgres)          │
│                                                     │
│  - consumer_id                                      │
│  - topic                                            │
│  - partition                                        │
│  - last_applied_offset                              │
│  - timestamp                                        │
└─────────────────────────────────────────────────────┘

Netflix Production Use Cases

Netflix documented several production use cases where WAL proved critical:

1. System Entropy Management

Problem: Application writes to Cassandra (primary) and EVCache (secondary). Network partition causes EVCache writes to fail. System now has inconsistent state.

WAL Solution:

Application writes to WAL once (1ms)
WAL consumer applies to both Cassandra and EVCache
Automatic retries with exponential backoff
If EVCache fails, consumer retries until success
Both datastores eventually consistent, guaranteed

Netflix Metric: Reduced data inconsistencies by 95% in multi-datastore applications.

2. Data Corruption Recovery

Problem: Database corruption due to bad application logic or storage failure. Need to restore from backup and replay changes.

WAL Solution:

Restore database from backup (e.g., 6 hours old)
Replay WAL mutations from 6 hours ago to present
Option to omit specific faulty mutations during replay
System back to consistent state

Netflix Metric: Reduced recovery time from hours to minutes for corruption incidents.

3. Traffic Spike Smoothing

Problem: Traffic spike causes database connection pool exhaustion. Direct writes start failing, cascading failures across microservices.

WAL Solution:

Writes go to Kafka (scales horizontally)
Consumer pool applies to database at sustainable rate
Queue builds up during spike, drains after
Application never sees database connection errors

Netflix Metric: Survived 10x traffic spikes without database saturation or application errors.

4. Generic Data Replication

Problem: Need to replicate data from us-east-1 to eu-west-1 for compliance. DynamoDB Global Tables too expensive.

WAL Solution:

WAL consumers in both regions
us-east-1 consumer applies to local DynamoDB
Kafka MirrorMaker replicates to eu-west-1
eu-west-1 consumer applies to local DynamoDB
Cross-region replication with 200-500ms lag

Netflix Metric: Reduced cross-region replication costs by 70% vs managed services.

5. Asynchronous Processing with Delayed Queues

Problem: Bulk delete operation needs to delete 1 million records. Doing it synchronously blocks application thread for minutes.

WAL Solution:

Write 1M delete operations to WAL with delay=60s, jitter=30s
Application returns immediately
Consumer processes deletes over 90 seconds (60s + 30s jitter)
Database load spread evenly, no saturation

Netflix Metric: Enabled bulk operations without impacting real-time latency.

Configuration

Prism WAL Namespace Configuration

namespaces:
  - name: order-mutations
    pattern: write-ahead-log
    backend: postgres  # Target database

    wal:
      # CRITICAL: Must use durable backend
      backend: kafka  # or nats-jetstream
      topic: order-wal

      # Retention (match to recovery SLA)
      retention: 604800  # 7 days

      # Durability settings (Netflix: 3-way replication)
      replication: 3
      min_in_sync_replicas: 2
      acks: all  # Wait for all replicas before ack

      # Partitioning (scale writes)
      partitions: 16
      partition_key: "{key}"  # Partition by order key

      # Consumer configuration
      consumers:
        - name: db-applier
          type: database_writer
          backend: postgres
          parallelism: 4  # 4 workers per partition

          # Batching (reduce DB round-trips)
          batch_size: 1000
          batch_timeout: 100ms

          # Retry policy (Netflix: exponential backoff)
          retry:
            max_attempts: 10
            base_delay: 100ms
            max_delay: 60s
            multiplier: 2

          # Dead letter queue (after max retries)
          dlq:
            topic: order-wal-dlq
            alert_on_message: true

      # Checkpoint configuration
      checkpoint:
        backend: postgres  # or sqlite for local dev
        table: wal_checkpoints
        interval: 5s  # Save checkpoint every 5s

    # Read consistency mode
    consistency:
      read_mode: wal_first  # Check WAL before DB

      # Read-your-writes guarantee
      # If client's last write offset > DB applied offset,
      # read from WAL instead of DB
      max_lag_tolerance: 1000ms  # Alert if lag exceeds 1s

    # Durability guarantees
    durability:
      fsync: true  # Kafka fsync before ack
      compression: snappy  # Reduce storage, maintain perf

    # Monitoring
    metrics:
      - wal_write_latency_p99
      - wal_lag_seconds
      - unapplied_entries_count
      - consumer_throughput
      - checkpoint_age_seconds

Multi-Datastore Configuration (Netflix Pattern)

Netflix's WAL commonly writes to multiple datastores simultaneously:

namespaces:
  - name: user-profile-mutations
    pattern: write-ahead-log

    wal:
      backend: kafka
      topic: user-profile-wal
      retention: 604800
      replication: 3

      # Multiple consumers for multiple datastores
      consumers:
        - name: cassandra-applier
          type: database_writer
          backend: cassandra
          keyspace: user_profiles
          parallelism: 8

        - name: elasticsearch-applier
          type: search_indexer
          backend: elasticsearch
          index: user_profiles
          parallelism: 4

        - name: redis-applier
          type: cache_writer
          backend: redis
          ttl: 3600
          parallelism: 2

    # All consumers must apply before data considered "applied"
    consistency:
      read_mode: wal_first
      require_all_consumers: true  # Wait for all 3 consumers

Client API

Rust Client

use prism_client::{Client, WalWriteOptions};

#[tokio::main]
async fn main() -> Result<()> {
    let client = Client::connect("http://localhost:8980").await?;

    // Write to WAL (fast path: 1-2ms)
    let result = client
        .wal_write("order-mutations", "order:12345", order_data)
        .await?;

    println!("Write complete:");
    println!("  WAL offset: {}", result.wal_offset);
    println!("  Partition: {}", result.partition);
    println!("  Latency: {}ms", result.latency_ms);
    // Output: Latency: 1.2ms (Netflix: p99 < 2ms)

    // Read with read-your-writes guarantee
    let order = client
        .wal_read("order-mutations", "order:12345")
        .await?;

    println!("Read complete:");
    println!("  Source: {}", order.source);  // "wal" or "database"
    println!("  Data: {:?}", order.data);

    // Check WAL lag
    let health = client.wal_health("order-mutations").await?;
    println!("WAL Health:");
    println!("  Latest offset: {}", health.latest_offset);
    println!("  Applied offset: {}", health.applied_offset);
    println!("  Lag: {} entries ({:.2}s)",
             health.lag_entries,
             health.lag_seconds);

    Ok(())
}

Go Client

package main

import (
    "context"
    "fmt"
    prism "github.com/prism/client-go"
)

func main() {
    client := prism.NewClient("localhost:8980")
    defer client.Close()

    ctx := context.Background()

    // Write to WAL
    result, err := client.WalWrite(ctx, &prism.WalWriteRequest{
        Namespace: "order-mutations",
        Key:       "order:12345",
        Data:      orderData,
    })
    if err != nil {
        panic(err)
    }

    fmt.Printf("WAL offset: %d, latency: %dms\n",
               result.WalOffset, result.LatencyMs)

    // Read with read-your-writes
    order, err := client.WalRead(ctx, &prism.WalReadRequest{
        Namespace: "order-mutations",
        Key:       "order:12345",
    })
    if err != nil {
        panic(err)
    }

    fmt.Printf("Source: %s, Data: %v\n", order.Source, order.Data)
}

Python Client

from prism_client import Client

async def main():
    async with Client("http://localhost:8980") as client:
        # Write to WAL (fast: 1-2ms)
        result = await client.wal_write(
            namespace="order-mutations",
            key="order:12345",
            data=order_data
        )

        print(f"WAL offset: {result.wal_offset}")
        print(f"Latency: {result.latency_ms}ms")

        # Read with read-your-writes guarantee
        order = await client.wal_read(
            namespace="order-mutations",
            key="order:12345"
        )

        print(f"Source: {order.source}")  # "wal" or "database"
        print(f"Data: {order.data}")

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Implementation Details

WAL Message Schema

message WalEntry {
  // Identity
  string client_id = 1;
  string namespace = 2;
  string key = 3;

  // Operation
  enum Operation {
    INSERT = 0;
    UPDATE = 1;
    DELETE = 2;
    UPSERT = 3;
  }
  Operation operation = 4;

  // Payload
  bytes data = 5;

  // Integrity
  string checksum = 6;  // SHA256 of data
  string idempotency_key = 7;  // UUID for deduplication

  // Metadata
  int64 timestamp_ms = 8;
  int32 partition = 9;
  int64 offset = 10;

  // Ordering (for complex mutations)
  int32 sequence_number = 11;
  bool is_completion_marker = 12;
}

Checkpoint Schema

CREATE TABLE wal_checkpoints (
  consumer_id TEXT NOT NULL,
  topic TEXT NOT NULL,
  partition INTEGER NOT NULL,
  last_applied_offset BIGINT NOT NULL,
  timestamp BIGINT NOT NULL,
  message_count BIGINT NOT NULL,
  PRIMARY KEY (consumer_id, topic, partition)
);

CREATE INDEX idx_checkpoint_timestamp
ON wal_checkpoints (timestamp DESC);

Idempotency Tracking

CREATE TABLE wal_idempotency (
  idempotency_key TEXT PRIMARY KEY,
  applied_at BIGINT NOT NULL,
  wal_offset BIGINT NOT NULL
);

-- TTL: Keep for 7 days (match WAL retention)
CREATE INDEX idx_idempotency_applied_at
ON wal_idempotency (applied_at);

-- Cleanup old entries
DELETE FROM wal_idempotency
WHERE applied_at < (EXTRACT(EPOCH FROM NOW()) * 1000 - 604800000);

Crash Recovery Flow

Consumer Crash Recovery

1. Consumer crashes at offset 500
   - Checkpoint saved at offset 500

2. Consumer restarts
   - Load checkpoint: last_applied_offset = 500

3. Resume from checkpoint
   - Poll Kafka from offset 500
   - Kafka returns messages [500..550]

4. Check idempotency
   - Query idempotency table for each message
   - Skip already-applied messages (500-505 already applied)
   - Apply new messages (506-550)

5. Update checkpoint
   - Save checkpoint: last_applied_offset = 550

6. Normal operation resumed

Database Corruption Recovery

1. Detect corruption
   - Application reports incorrect data
   - Investigation finds bad mutation at offset 42,500

2. Stop consumers
   - Prevent further writes

3. Restore from backup
   - Restore database to state at offset 40,000

4. Selective replay
   - Replay WAL from offset 40,000
   - Skip offset 42,500 (faulty mutation)
   - Apply all other mutations up to latest offset

5. Resume normal operation
   - Database consistent
   - Consumers resume from latest offset

Monitoring and Alerting

Key Metrics (Netflix-Informed)

# Write latency (target: p99 < 2ms)
prism_wal_write_latency_seconds{
  namespace="order-mutations",
  quantile="0.99"
} < 0.002

# WAL lag (target: < 1 second)
prism_wal_lag_seconds{
  namespace="order-mutations"
} < 1.0

# Unapplied entries (target: < 1000)
prism_wal_unapplied_entries{
  namespace="order-mutations"
} < 1000

# Consumer throughput
prism_wal_consumer_throughput{
  namespace="order-mutations",
  consumer="db-applier"
} > 1000  # msgs/sec

# Checkpoint freshness (target: < 30 seconds)
prism_wal_checkpoint_age_seconds{
  namespace="order-mutations",
  consumer="db-applier"
} < 30

# Dead letter queue depth (alert if > 0)
prism_wal_dlq_depth{
  namespace="order-mutations"
} == 0

Alert Rules

alerts:
  - name: HighWALLag
    condition: prism_wal_lag_seconds > 10
    severity: warning
    message: "WAL lag exceeds 10 seconds"
    runbook: https://docs.prism.io/runbooks/wal-lag

  - name: CriticalWALLag
    condition: prism_wal_lag_seconds > 60
    severity: critical
    message: "WAL lag exceeds 60 seconds - consumer may be stuck"
    action: page_oncall

  - name: DLQMessagesPresent
    condition: prism_wal_dlq_depth > 0
    severity: warning
    message: "Messages in dead letter queue require manual intervention"
    action: create_ticket

  - name: CheckpointStale
    condition: prism_wal_checkpoint_age_seconds > 300
    severity: critical
    message: "Checkpoint not updated in 5 minutes - consumer may be dead"
    action: page_oncall

Netflix Production Lessons Learned

1. Automatic Scaling is Essential

Netflix: WAL consumers must scale independently of producers. During traffic spikes, producers scale up to handle writes, while consumers maintain steady throughput.

Prism Implementation:

Kubernetes HPA based on prism_wal_lag_seconds metric
Scale consumers up when lag > 10 seconds
Scale down when lag < 1 second for 5 minutes

2. Adaptive Load Shedding

Netflix: When database is under load, back-pressure can propagate to WAL consumers. Adaptive load shedding prevents cascading failures.

Prism Implementation:

Monitor database connection pool utilization
If pool > 90% utilized, reduce consumer batch size
If pool > 95% utilized, add delay between batches
Circuit breaker after 3 consecutive batch failures

3. Dedicated Dead Letter Queues

Netflix: Some mutations will never succeed (e.g., bad data format, constraint violation). Don't retry forever.

Prism Implementation:

After 10 retry attempts, move to DLQ
DLQ messages trigger alert for manual review
Operators can replay from DLQ after fixing root cause
Option to skip DLQ messages during replay

4. Namespace-Driven Configuration

Netflix: Different use cases need different WAL configurations. One size does not fit all.

Prism Implementation:

Per-namespace retention policies (1 day to 30 days)
Per-namespace replication factors (1x to 3x)
Per-namespace consumer parallelism (1 to 16 workers)
Per-namespace retry policies

5. Observability is Non-Negotiable

Netflix: WAL is infrastructure. Operators must see lag, throughput, errors in real-time.

Prism Implementation:

Rich metrics (write latency, lag, throughput, DLQ depth)
Distributed tracing (span from client write to DB apply)
Structured logging with trace IDs
Grafana dashboards per namespace

Comparison to Alternatives

WAL vs Direct Database Writes

Aspect	WAL Pattern	Direct Writes
Write Latency	1-2ms (Kafka)	10-50ms (database)
Durability	Kafka 3-way replication	Database replication
Crash Recovery	Replay from checkpoint	Database recovery only
Traffic Spike Handling	Queue buffers spike	Database saturates
Multi-Datastore	Single write, multiple applies	Dual write complexity
Complexity	Medium (adds Kafka)	Low (just database)

When to use WAL:

✅ High write throughput (1000s of writes/sec)
✅ Need sub-5ms write latency
✅ Writing to multiple datastores
✅ Need point-in-time recovery
✅ Traffic spikes are common

When to use direct writes:

✅ Low write volume (<100 writes/sec)
✅ Single datastore
✅ Simplicity is paramount
✅ Strong consistency required (not eventual)

WAL vs Event Sourcing

Aspect	WAL	Event Sourcing
Purpose	Fast, durable writes	Full audit trail
Retention	7-30 days	Forever
State Reconstruction	From database	From event log
Complexity	Medium	High
Read Model	Database	Materialized views

Use WAL when: You need fast writes and durability, not full audit trail. Use Event Sourcing when: You need time-travel queries and complete history.

WAL vs Outbox Pattern

Aspect	WAL	Outbox
Write Path	Direct to Kafka	Database transaction + outbox
Consistency	Eventual	Transactional
Latency	1-2ms	10-20ms (database)
Use Case	High-throughput writes	Transactional messaging

Use WAL when: Write throughput is critical, eventual consistency acceptable. Use Outbox when: Need ACID guarantees for event publishing.

Implementation Roadmap

Phase 1: Core WAL Pattern (4 weeks)

✅ Kafka producer integration
✅ WAL write API (gRPC/HTTP)
✅ Basic consumer with database writer
✅ Checkpoint management (SQLite)
✅ Idempotency tracking
✅ Basic monitoring (lag, throughput)

Phase 2: Production Readiness (3 weeks)

✅ Dead letter queue
✅ Retry with exponential backoff
✅ Read-your-writes consistency
✅ Crash recovery testing
✅ Multi-consumer configuration
✅ Grafana dashboards

Phase 3: Netflix-Inspired Features (4 weeks)

✅ Adaptive load shedding
✅ Automatic consumer scaling
✅ Cross-region replication (MirrorMaker)
✅ Selective replay (skip faulty mutations)
✅ Delayed queue support
✅ Performance tuning (batching, compression)

References

Netflix Production Experience

Netflix Tech Blog: Building a Resilient Data Platform with Write-Ahead Log: Original source of production learnings
Netflix Data Gateway Reference: Overview of Netflix's data gateway architecture
Netflix WAL Documentation: Summary of WAL use cases and functionality

Prism Documentation

MEMO-001: WAL Full Transaction Flow: Detailed sequence diagrams for WAL operations
RFC-009: Distributed Reliability Patterns: WAL pattern alongside other reliability patterns
ADR-002: Client-Originated Configuration: How WAL namespaces are configured

Kafka & Durability

Kafka Documentation: Replication: Kafka's durability guarantees
Kafka Documentation: Consumer Groups: Consumer offset management
NATS JetStream Documentation: Alternative durable backend

Outbox Pattern (Microservices.io): Transactional alternative to WAL
Event Sourcing (Martin Fowler): Full audit trail pattern
Change Data Capture (Debezium): Database WAL-based replication

Revision History

2025-11-07: Initial draft grounded in Netflix production experience and learnings
- Documented Netflix's real-world use cases and metrics
- Included production lessons learned (scaling, load shedding, DLQ)
- Added comprehensive monitoring and alerting guidance
- Provided implementation roadmap with Netflix-inspired features

Overview​

Production Context: Netflix's WAL at Scale​

Real-World Metrics​

Problem Statement (Netflix's View)​

Architecture​

High-Level Flow​

Component Architecture​

Netflix Production Use Cases​

1. System Entropy Management​

2. Data Corruption Recovery​

3. Traffic Spike Smoothing​

4. Generic Data Replication​

5. Asynchronous Processing with Delayed Queues​

Configuration​

Prism WAL Namespace Configuration​

Multi-Datastore Configuration (Netflix Pattern)​

Client API​

Rust Client​

Go Client​

Python Client​

Implementation Details​

WAL Message Schema​

Checkpoint Schema​

Idempotency Tracking​

Crash Recovery Flow​

Consumer Crash Recovery​

Database Corruption Recovery​

Monitoring and Alerting​

Key Metrics (Netflix-Informed)​

Alert Rules​

Netflix Production Lessons Learned​

1. Automatic Scaling is Essential​

2. Adaptive Load Shedding​

3. Dedicated Dead Letter Queues​

4. Namespace-Driven Configuration​

5. Observability is Non-Negotiable​

Comparison to Alternatives​

WAL vs Direct Database Writes​

WAL vs Event Sourcing​

WAL vs Outbox Pattern​

Implementation Roadmap​

Phase 1: Core WAL Pattern (4 weeks)​

Phase 2: Production Readiness (3 weeks)​

Phase 3: Netflix-Inspired Features (4 weeks)​

References​

Netflix Production Experience​

Prism Documentation​

Kafka & Durability​

Related Patterns​

Revision History​

Overview

Production Context: Netflix's WAL at Scale

Real-World Metrics

Problem Statement (Netflix's View)

Architecture

High-Level Flow

Component Architecture

Netflix Production Use Cases

1. System Entropy Management

2. Data Corruption Recovery

3. Traffic Spike Smoothing

4. Generic Data Replication

5. Asynchronous Processing with Delayed Queues

Configuration

Prism WAL Namespace Configuration

Multi-Datastore Configuration (Netflix Pattern)

Client API

Rust Client

Go Client

Python Client

Implementation Details

WAL Message Schema

Checkpoint Schema

Idempotency Tracking

Crash Recovery Flow

Consumer Crash Recovery

Database Corruption Recovery

Monitoring and Alerting

Key Metrics (Netflix-Informed)

Alert Rules

Netflix Production Lessons Learned

1. Automatic Scaling is Essential

2. Adaptive Load Shedding

3. Dedicated Dead Letter Queues

4. Namespace-Driven Configuration

5. Observability is Non-Negotiable

Comparison to Alternatives

WAL vs Direct Database Writes

WAL vs Event Sourcing

WAL vs Outbox Pattern

Implementation Roadmap

Phase 1: Core WAL Pattern (4 weeks)

Phase 2: Production Readiness (3 weeks)

Phase 3: Netflix-Inspired Features (4 weeks)

References

Netflix Production Experience

Prism Documentation

Kafka & Durability

Related Patterns

Revision History