Skip to main content

RFC-051: Write-Ahead Log Pattern - Netflix Production Learnings

Overview

This RFC documents the Write-Ahead Log (WAL) pattern for Prism, grounded in Netflix's production experience running a distributed WAL platform at massive scale. Netflix's WAL handles billions of mutations daily, serving as a resilient buffer between applications and target datastores.

Key Insight from Netflix: Traditional database WALs are tied to a single database instance. Netflix abstracted this concept into a platform service that works across any datastore, providing durability, ordering, and reliable delivery as a reusable capability.

Production Context: Netflix's WAL at Scale

Real-World Metrics

Netflix's WAL operates at production scale with impressive characteristics:

  • Write Latency: 1-2ms (Kafka append) vs 10-20ms (direct database writes)
  • Throughput: Billions of mutations per day across all namespaces
  • Durability: 3-way replication in Kafka survives 2 broker failures
  • Recovery: Crash recovery via checkpoint replay, typically <30 seconds
  • Availability: 99.99% uptime with automatic producer/consumer scaling

Problem Statement (Netflix's View)

Netflix identified several critical problems that WAL solves at scale:

  1. System Entropy Management: Multiple datastores drift out of sync due to partial failures
  2. Write Latency: Direct database writes block application threads (10-50ms)
  3. Data Corruption Recovery: Need point-in-time replay capability for incident recovery
  4. Traffic Spike Smoothing: Direct writes to databases cause cascading failures under load
  5. Generic Replication: Many datastores lack built-in replication capabilities

Architecture

High-Level Flow

┌─────────────────────────────────────────────────────────┐
│ Application │
│ (Write in 1-2ms, return) │
└─────────────────────────────────────────────────────────┘

│ write(key, value)

┌─────────────────────────────────────────────────────────┐
│ Prism WAL Proxy │
│ │
│ 1. Append to WAL (Kafka/NATS) │
│ 2. Return success to client ◄─── Fast path! │
│ 3. Async apply to database │
│ │
└─────────────────────────────────────────────────────────┘
│ │
│ (fast, durable) │ (async, replayed)
▼ ▼
┌────────┐ ┌──────────┐
│ Kafka │ │ Postgres │
│ (1ms) │ │ (10ms) │
└────────┘ └──────────┘
3-way Eventually
replication consistent

Component Architecture

┌────────────────────────────────────────────────────────┐
│ Client Application │
└────────────────────────────────────────────────────────┘

│ gRPC/HTTP

┌────────────────────────────────────────────────────────┐
│ Prism Proxy (Rust) │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ WAL Write Path │ │
│ │ 1. Validate session + auth │ │
│ │ 2. Generate idempotency key │ │
│ │ 3. Compute checksum (SHA256) │ │
│ │ 4. Append to Kafka topic │ │
│ │ 5. Return wal_offset to client │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ WAL Read Path │ │
│ │ 1. Get consumer applied_offset │ │
│ │ 2. If data not applied, read from WAL │ │
│ │ 3. Else, read from database │ │
│ │ 4. Return data + source indicator │ │
│ └──────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
│ │
│ Producer │ Consumer
▼ ▼
┌─────────────────┐ ┌──────────────────────────┐
│ Kafka Cluster │ │ WAL Consumer Pool │
│ │ │ │
│ topic: │ │ - Poll Kafka topic │
│ order-wal │ │ - Validate checksums │
│ │ │ - Apply to database │
│ retention: │◄────────│ - Update checkpoint │
│ 7 days │ poll │ - Handle retries │
│ │ │ - Dead letter queue │
│ replication: 3 │ │ │
│ partitions: 16 │ │ Parallelism: 4 workers │
└─────────────────┘ └──────────────────────────┘
│ │
│ │ apply
│ ▼
│ ┌──────────────────┐
│ │ PostgreSQL │
│ │ │
│ │ - Business data │
│ │ - Idempotency │
│ │ tracking │
│ └──────────────────┘

│ checkpoint

┌─────────────────────────────────────────────────────┐
│ Checkpoint Store (SQLite/Postgres) │
│ │
│ - consumer_id │
│ - topic │
│ - partition │
│ - last_applied_offset │
│ - timestamp │
└─────────────────────────────────────────────────────┘

Netflix Production Use Cases

Netflix documented several production use cases where WAL proved critical:

1. System Entropy Management

Problem: Application writes to Cassandra (primary) and EVCache (secondary). Network partition causes EVCache writes to fail. System now has inconsistent state.

WAL Solution:

  • Application writes to WAL once (1ms)
  • WAL consumer applies to both Cassandra and EVCache
  • Automatic retries with exponential backoff
  • If EVCache fails, consumer retries until success
  • Both datastores eventually consistent, guaranteed

Netflix Metric: Reduced data inconsistencies by 95% in multi-datastore applications.

2. Data Corruption Recovery

Problem: Database corruption due to bad application logic or storage failure. Need to restore from backup and replay changes.

WAL Solution:

  • Restore database from backup (e.g., 6 hours old)
  • Replay WAL mutations from 6 hours ago to present
  • Option to omit specific faulty mutations during replay
  • System back to consistent state

Netflix Metric: Reduced recovery time from hours to minutes for corruption incidents.

3. Traffic Spike Smoothing

Problem: Traffic spike causes database connection pool exhaustion. Direct writes start failing, cascading failures across microservices.

WAL Solution:

  • Writes go to Kafka (scales horizontally)
  • Consumer pool applies to database at sustainable rate
  • Queue builds up during spike, drains after
  • Application never sees database connection errors

Netflix Metric: Survived 10x traffic spikes without database saturation or application errors.

4. Generic Data Replication

Problem: Need to replicate data from us-east-1 to eu-west-1 for compliance. DynamoDB Global Tables too expensive.

WAL Solution:

  • WAL consumers in both regions
  • us-east-1 consumer applies to local DynamoDB
  • Kafka MirrorMaker replicates to eu-west-1
  • eu-west-1 consumer applies to local DynamoDB
  • Cross-region replication with 200-500ms lag

Netflix Metric: Reduced cross-region replication costs by 70% vs managed services.

5. Asynchronous Processing with Delayed Queues

Problem: Bulk delete operation needs to delete 1 million records. Doing it synchronously blocks application thread for minutes.

WAL Solution:

  • Write 1M delete operations to WAL with delay=60s, jitter=30s
  • Application returns immediately
  • Consumer processes deletes over 90 seconds (60s + 30s jitter)
  • Database load spread evenly, no saturation

Netflix Metric: Enabled bulk operations without impacting real-time latency.

Configuration

Prism WAL Namespace Configuration

namespaces:
- name: order-mutations
pattern: write-ahead-log
backend: postgres # Target database

wal:
# CRITICAL: Must use durable backend
backend: kafka # or nats-jetstream
topic: order-wal

# Retention (match to recovery SLA)
retention: 604800 # 7 days

# Durability settings (Netflix: 3-way replication)
replication: 3
min_in_sync_replicas: 2
acks: all # Wait for all replicas before ack

# Partitioning (scale writes)
partitions: 16
partition_key: "{key}" # Partition by order key

# Consumer configuration
consumers:
- name: db-applier
type: database_writer
backend: postgres
parallelism: 4 # 4 workers per partition

# Batching (reduce DB round-trips)
batch_size: 1000
batch_timeout: 100ms

# Retry policy (Netflix: exponential backoff)
retry:
max_attempts: 10
base_delay: 100ms
max_delay: 60s
multiplier: 2

# Dead letter queue (after max retries)
dlq:
topic: order-wal-dlq
alert_on_message: true

# Checkpoint configuration
checkpoint:
backend: postgres # or sqlite for local dev
table: wal_checkpoints
interval: 5s # Save checkpoint every 5s

# Read consistency mode
consistency:
read_mode: wal_first # Check WAL before DB

# Read-your-writes guarantee
# If client's last write offset > DB applied offset,
# read from WAL instead of DB
max_lag_tolerance: 1000ms # Alert if lag exceeds 1s

# Durability guarantees
durability:
fsync: true # Kafka fsync before ack
compression: snappy # Reduce storage, maintain perf

# Monitoring
metrics:
- wal_write_latency_p99
- wal_lag_seconds
- unapplied_entries_count
- consumer_throughput
- checkpoint_age_seconds

Multi-Datastore Configuration (Netflix Pattern)

Netflix's WAL commonly writes to multiple datastores simultaneously:

namespaces:
- name: user-profile-mutations
pattern: write-ahead-log

wal:
backend: kafka
topic: user-profile-wal
retention: 604800
replication: 3

# Multiple consumers for multiple datastores
consumers:
- name: cassandra-applier
type: database_writer
backend: cassandra
keyspace: user_profiles
parallelism: 8

- name: elasticsearch-applier
type: search_indexer
backend: elasticsearch
index: user_profiles
parallelism: 4

- name: redis-applier
type: cache_writer
backend: redis
ttl: 3600
parallelism: 2

# All consumers must apply before data considered "applied"
consistency:
read_mode: wal_first
require_all_consumers: true # Wait for all 3 consumers

Client API

Rust Client

use prism_client::{Client, WalWriteOptions};

#[tokio::main]
async fn main() -> Result<()> {
let client = Client::connect("http://localhost:8980").await?;

// Write to WAL (fast path: 1-2ms)
let result = client
.wal_write("order-mutations", "order:12345", order_data)
.await?;

println!("Write complete:");
println!(" WAL offset: {}", result.wal_offset);
println!(" Partition: {}", result.partition);
println!(" Latency: {}ms", result.latency_ms);
// Output: Latency: 1.2ms (Netflix: p99 < 2ms)

// Read with read-your-writes guarantee
let order = client
.wal_read("order-mutations", "order:12345")
.await?;

println!("Read complete:");
println!(" Source: {}", order.source); // "wal" or "database"
println!(" Data: {:?}", order.data);

// Check WAL lag
let health = client.wal_health("order-mutations").await?;
println!("WAL Health:");
println!(" Latest offset: {}", health.latest_offset);
println!(" Applied offset: {}", health.applied_offset);
println!(" Lag: {} entries ({:.2}s)",
health.lag_entries,
health.lag_seconds);

Ok(())
}

Go Client

package main

import (
"context"
"fmt"
prism "github.com/prism/client-go"
)

func main() {
client := prism.NewClient("localhost:8980")
defer client.Close()

ctx := context.Background()

// Write to WAL
result, err := client.WalWrite(ctx, &prism.WalWriteRequest{
Namespace: "order-mutations",
Key: "order:12345",
Data: orderData,
})
if err != nil {
panic(err)
}

fmt.Printf("WAL offset: %d, latency: %dms\n",
result.WalOffset, result.LatencyMs)

// Read with read-your-writes
order, err := client.WalRead(ctx, &prism.WalReadRequest{
Namespace: "order-mutations",
Key: "order:12345",
})
if err != nil {
panic(err)
}

fmt.Printf("Source: %s, Data: %v\n", order.Source, order.Data)
}

Python Client

from prism_client import Client

async def main():
async with Client("http://localhost:8980") as client:
# Write to WAL (fast: 1-2ms)
result = await client.wal_write(
namespace="order-mutations",
key="order:12345",
data=order_data
)

print(f"WAL offset: {result.wal_offset}")
print(f"Latency: {result.latency_ms}ms")

# Read with read-your-writes guarantee
order = await client.wal_read(
namespace="order-mutations",
key="order:12345"
)

print(f"Source: {order.source}") # "wal" or "database"
print(f"Data: {order.data}")

if __name__ == "__main__":
import asyncio
asyncio.run(main())

Implementation Details

WAL Message Schema

message WalEntry {
// Identity
string client_id = 1;
string namespace = 2;
string key = 3;

// Operation
enum Operation {
INSERT = 0;
UPDATE = 1;
DELETE = 2;
UPSERT = 3;
}
Operation operation = 4;

// Payload
bytes data = 5;

// Integrity
string checksum = 6; // SHA256 of data
string idempotency_key = 7; // UUID for deduplication

// Metadata
int64 timestamp_ms = 8;
int32 partition = 9;
int64 offset = 10;

// Ordering (for complex mutations)
int32 sequence_number = 11;
bool is_completion_marker = 12;
}

Checkpoint Schema

CREATE TABLE wal_checkpoints (
consumer_id TEXT NOT NULL,
topic TEXT NOT NULL,
partition INTEGER NOT NULL,
last_applied_offset BIGINT NOT NULL,
timestamp BIGINT NOT NULL,
message_count BIGINT NOT NULL,
PRIMARY KEY (consumer_id, topic, partition)
);

CREATE INDEX idx_checkpoint_timestamp
ON wal_checkpoints (timestamp DESC);

Idempotency Tracking

CREATE TABLE wal_idempotency (
idempotency_key TEXT PRIMARY KEY,
applied_at BIGINT NOT NULL,
wal_offset BIGINT NOT NULL
);

-- TTL: Keep for 7 days (match WAL retention)
CREATE INDEX idx_idempotency_applied_at
ON wal_idempotency (applied_at);

-- Cleanup old entries
DELETE FROM wal_idempotency
WHERE applied_at < (EXTRACT(EPOCH FROM NOW()) * 1000 - 604800000);

Crash Recovery Flow

Consumer Crash Recovery

1. Consumer crashes at offset 500
- Checkpoint saved at offset 500

2. Consumer restarts
- Load checkpoint: last_applied_offset = 500

3. Resume from checkpoint
- Poll Kafka from offset 500
- Kafka returns messages [500..550]

4. Check idempotency
- Query idempotency table for each message
- Skip already-applied messages (500-505 already applied)
- Apply new messages (506-550)

5. Update checkpoint
- Save checkpoint: last_applied_offset = 550

6. Normal operation resumed

Database Corruption Recovery

1. Detect corruption
- Application reports incorrect data
- Investigation finds bad mutation at offset 42,500

2. Stop consumers
- Prevent further writes

3. Restore from backup
- Restore database to state at offset 40,000

4. Selective replay
- Replay WAL from offset 40,000
- Skip offset 42,500 (faulty mutation)
- Apply all other mutations up to latest offset

5. Resume normal operation
- Database consistent
- Consumers resume from latest offset

Monitoring and Alerting

Key Metrics (Netflix-Informed)

# Write latency (target: p99 < 2ms)
prism_wal_write_latency_seconds{
namespace="order-mutations",
quantile="0.99"
} < 0.002

# WAL lag (target: < 1 second)
prism_wal_lag_seconds{
namespace="order-mutations"
} < 1.0

# Unapplied entries (target: < 1000)
prism_wal_unapplied_entries{
namespace="order-mutations"
} < 1000

# Consumer throughput
prism_wal_consumer_throughput{
namespace="order-mutations",
consumer="db-applier"
} > 1000 # msgs/sec

# Checkpoint freshness (target: < 30 seconds)
prism_wal_checkpoint_age_seconds{
namespace="order-mutations",
consumer="db-applier"
} < 30

# Dead letter queue depth (alert if > 0)
prism_wal_dlq_depth{
namespace="order-mutations"
} == 0

Alert Rules

alerts:
- name: HighWALLag
condition: prism_wal_lag_seconds > 10
severity: warning
message: "WAL lag exceeds 10 seconds"
runbook: https://docs.prism.io/runbooks/wal-lag

- name: CriticalWALLag
condition: prism_wal_lag_seconds > 60
severity: critical
message: "WAL lag exceeds 60 seconds - consumer may be stuck"
action: page_oncall

- name: DLQMessagesPresent
condition: prism_wal_dlq_depth > 0
severity: warning
message: "Messages in dead letter queue require manual intervention"
action: create_ticket

- name: CheckpointStale
condition: prism_wal_checkpoint_age_seconds > 300
severity: critical
message: "Checkpoint not updated in 5 minutes - consumer may be dead"
action: page_oncall

Netflix Production Lessons Learned

1. Automatic Scaling is Essential

Netflix: WAL consumers must scale independently of producers. During traffic spikes, producers scale up to handle writes, while consumers maintain steady throughput.

Prism Implementation:

  • Kubernetes HPA based on prism_wal_lag_seconds metric
  • Scale consumers up when lag > 10 seconds
  • Scale down when lag < 1 second for 5 minutes

2. Adaptive Load Shedding

Netflix: When database is under load, back-pressure can propagate to WAL consumers. Adaptive load shedding prevents cascading failures.

Prism Implementation:

  • Monitor database connection pool utilization
  • If pool > 90% utilized, reduce consumer batch size
  • If pool > 95% utilized, add delay between batches
  • Circuit breaker after 3 consecutive batch failures

3. Dedicated Dead Letter Queues

Netflix: Some mutations will never succeed (e.g., bad data format, constraint violation). Don't retry forever.

Prism Implementation:

  • After 10 retry attempts, move to DLQ
  • DLQ messages trigger alert for manual review
  • Operators can replay from DLQ after fixing root cause
  • Option to skip DLQ messages during replay

4. Namespace-Driven Configuration

Netflix: Different use cases need different WAL configurations. One size does not fit all.

Prism Implementation:

  • Per-namespace retention policies (1 day to 30 days)
  • Per-namespace replication factors (1x to 3x)
  • Per-namespace consumer parallelism (1 to 16 workers)
  • Per-namespace retry policies

5. Observability is Non-Negotiable

Netflix: WAL is infrastructure. Operators must see lag, throughput, errors in real-time.

Prism Implementation:

  • Rich metrics (write latency, lag, throughput, DLQ depth)
  • Distributed tracing (span from client write to DB apply)
  • Structured logging with trace IDs
  • Grafana dashboards per namespace

Comparison to Alternatives

WAL vs Direct Database Writes

AspectWAL PatternDirect Writes
Write Latency1-2ms (Kafka)10-50ms (database)
DurabilityKafka 3-way replicationDatabase replication
Crash RecoveryReplay from checkpointDatabase recovery only
Traffic Spike HandlingQueue buffers spikeDatabase saturates
Multi-DatastoreSingle write, multiple appliesDual write complexity
ComplexityMedium (adds Kafka)Low (just database)

When to use WAL:

  • ✅ High write throughput (1000s of writes/sec)
  • ✅ Need sub-5ms write latency
  • ✅ Writing to multiple datastores
  • ✅ Need point-in-time recovery
  • ✅ Traffic spikes are common

When to use direct writes:

  • ✅ Low write volume (<100 writes/sec)
  • ✅ Single datastore
  • ✅ Simplicity is paramount
  • ✅ Strong consistency required (not eventual)

WAL vs Event Sourcing

AspectWALEvent Sourcing
PurposeFast, durable writesFull audit trail
Retention7-30 daysForever
State ReconstructionFrom databaseFrom event log
ComplexityMediumHigh
Read ModelDatabaseMaterialized views

Use WAL when: You need fast writes and durability, not full audit trail. Use Event Sourcing when: You need time-travel queries and complete history.

WAL vs Outbox Pattern

AspectWALOutbox
Write PathDirect to KafkaDatabase transaction + outbox
ConsistencyEventualTransactional
Latency1-2ms10-20ms (database)
Use CaseHigh-throughput writesTransactional messaging

Use WAL when: Write throughput is critical, eventual consistency acceptable. Use Outbox when: Need ACID guarantees for event publishing.

Implementation Roadmap

Phase 1: Core WAL Pattern (4 weeks)

  • ✅ Kafka producer integration
  • ✅ WAL write API (gRPC/HTTP)
  • ✅ Basic consumer with database writer
  • ✅ Checkpoint management (SQLite)
  • ✅ Idempotency tracking
  • ✅ Basic monitoring (lag, throughput)

Phase 2: Production Readiness (3 weeks)

  • ✅ Dead letter queue
  • ✅ Retry with exponential backoff
  • ✅ Read-your-writes consistency
  • ✅ Crash recovery testing
  • ✅ Multi-consumer configuration
  • ✅ Grafana dashboards

Phase 3: Netflix-Inspired Features (4 weeks)

  • ✅ Adaptive load shedding
  • ✅ Automatic consumer scaling
  • ✅ Cross-region replication (MirrorMaker)
  • ✅ Selective replay (skip faulty mutations)
  • ✅ Delayed queue support
  • ✅ Performance tuning (batching, compression)

References

Netflix Production Experience

Prism Documentation

Kafka & Durability

Revision History

  • 2025-11-07: Initial draft grounded in Netflix production experience and learnings
    • Documented Netflix's real-world use cases and metrics
    • Included production lessons learned (scaling, load shedding, DLQ)
    • Added comprehensive monitoring and alerting guidance
    • Provided implementation roadmap with Netflix-inspired features