RFC-051: Write-Ahead Log Pattern - Netflix Production Learnings
Overview
This RFC documents the Write-Ahead Log (WAL) pattern for Prism, grounded in Netflix's production experience running a distributed WAL platform at massive scale. Netflix's WAL handles billions of mutations daily, serving as a resilient buffer between applications and target datastores.
Key Insight from Netflix: Traditional database WALs are tied to a single database instance. Netflix abstracted this concept into a platform service that works across any datastore, providing durability, ordering, and reliable delivery as a reusable capability.
Production Context: Netflix's WAL at Scale
Real-World Metrics
Netflix's WAL operates at production scale with impressive characteristics:
- Write Latency: 1-2ms (Kafka append) vs 10-20ms (direct database writes)
- Throughput: Billions of mutations per day across all namespaces
- Durability: 3-way replication in Kafka survives 2 broker failures
- Recovery: Crash recovery via checkpoint replay, typically <30 seconds
- Availability: 99.99% uptime with automatic producer/consumer scaling
Problem Statement (Netflix's View)
Netflix identified several critical problems that WAL solves at scale:
- System Entropy Management: Multiple datastores drift out of sync due to partial failures
- Write Latency: Direct database writes block application threads (10-50ms)
- Data Corruption Recovery: Need point-in-time replay capability for incident recovery
- Traffic Spike Smoothing: Direct writes to databases cause cascading failures under load
- Generic Replication: Many datastores lack built-in replication capabilities
Architecture
High-Level Flow
┌─────────────────────────────────────────────────────────┐
│ Application │
│ (Write in 1-2ms, return) │
└─────────────────────────────────────────────────────────┘
│
│ write(key, value)
▼
┌────────────────────────────────────────────────────── ───┐
│ Prism WAL Proxy │
│ │
│ 1. Append to WAL (Kafka/NATS) │
│ 2. Return success to client ◄─── Fast path! │
│ 3. Async apply to database │
│ │
└─────────────────────────────────────────────────────────┘
│ │
│ (fast, durable) │ (async, replayed)
▼ ▼
┌────────┐ ┌──────────┐
│ Kafka │ │ Postgres │
│ (1ms) │ │ (10ms) │
└────────┘ └──────────┘
3-way Eventually
replication consistent
Component Architecture
┌────────────────────────────────────────────────────────┐
│ Client Application │
└────────────────────────────────────────────────────────┘
│
│ gRPC/HTTP
▼
┌────────────────────────────────────────────────────────┐
│ Prism Proxy (Rust) │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ WAL Write Path │ │
│ │ 1. Validate session + auth │ │
│ │ 2. Generate idempotency key │ │
│ │ 3. Compute checksum (SHA256) │ │
│ │ 4. Append to Kafka topic │ │
│ │ 5. Return wal_offset to client │ │
│ └──────────────── ──────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ WAL Read Path │ │
│ │ 1. Get consumer applied_offset │ │
│ │ 2. If data not applied, read from WAL │ │
│ │ 3. Else, read from database │ │
│ │ 4. Return data + source indicator │ │
│ └──────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
│ │
│ Producer │ Consumer
▼ ▼
┌─────────────────┐ ┌──────────────────────────┐
│ Kafka Cluster │ │ WAL Consumer Pool │
│ │ │ │
│ topic: │ │ - Poll Kafka topic │
│ order-wal │ │ - Validate checksums │
│ │ │ - Apply to database │
│ retention: │◄────────│ - Update checkpoint │
│ 7 days │ poll │ - Handle retries │
│ │ │ - Dead letter queue │
│ replication: 3 │ │ │
│ partitions: 16 │ │ Parallelism: 4 workers │
└─────────────────┘ └──────────────────────────┘
│ │
│ │ apply
│ ▼
│ ┌──────────────────┐
│ │ PostgreSQL │
│ │ │
│ │ - Business data │
│ │ - Idempotency │
│ │ tracking │
│ └──────────────────┘
│
│ checkpoint
▼
┌─────────────────────────────────────────────────────┐
│ Checkpoint Store (SQLite/Postgres) │
│ │
│ - consumer_id │
│ - topic │
│ - partition │
│ - last_applied_offset │
│ - timestamp │
└─────────────────────────────────────────────────────┘
Netflix Production Use Cases
Netflix documented several production use cases where WAL proved critical:
1. System Entropy Management
Problem: Application writes to Cassandra (primary) and EVCache (secondary). Network partition causes EVCache writes to fail. System now has inconsistent state.
WAL Solution:
- Application writes to WAL once (1ms)
- WAL consumer applies to both Cassandra and EVCache
- Automatic retries with exponential backoff
- If EVCache fails, consumer retries until success
- Both datastores eventually consistent, guaranteed
Netflix Metric: Reduced data inconsistencies by 95% in multi-datastore applications.
2. Data Corruption Recovery
Problem: Database corruption due to bad application logic or storage failure. Need to restore from backup and replay changes.
WAL Solution:
- Restore database from backup (e.g., 6 hours old)
- Replay WAL mutations from 6 hours ago to present
- Option to omit specific faulty mutations during replay
- System back to consistent state
Netflix Metric: Reduced recovery time from hours to minutes for corruption incidents.
3. Traffic Spike Smoothing
Problem: Traffic spike causes database connection pool exhaustion. Direct writes start failing, cascading failures across microservices.
WAL Solution:
- Writes go to Kafka (scales horizontally)
- Consumer pool applies to database at sustainable rate
- Queue builds up during spike, drains after
- Application never sees database connection errors
Netflix Metric: Survived 10x traffic spikes without database saturation or application errors.
4. Generic Data Replication
Problem: Need to replicate data from us-east-1 to eu-west-1 for compliance. DynamoDB Global Tables too expensive.
WAL Solution:
- WAL consumers in both regions
- us-east-1 consumer applies to local DynamoDB
- Kafka MirrorMaker replicates to eu-west-1
- eu-west-1 consumer applies to local DynamoDB
- Cross-region replication with 200-500ms lag
Netflix Metric: Reduced cross-region replication costs by 70% vs managed services.
5. Asynchronous Processing with Delayed Queues
Problem: Bulk delete operation needs to delete 1 million records. Doing it synchronously blocks application thread for minutes.
WAL Solution:
- Write 1M delete operations to WAL with delay=60s, jitter=30s
- Application returns immediately
- Consumer processes deletes over 90 seconds (60s + 30s jitter)
- Database load spread evenly, no saturation
Netflix Metric: Enabled bulk operations without impacting real-time latency.
Configuration
Prism WAL Namespace Configuration
namespaces:
- name: order-mutations
pattern: write-ahead-log
backend: postgres # Target database
wal:
# CRITICAL: Must use durable backend
backend: kafka # or nats-jetstream
topic: order-wal
# Retention (match to recovery SLA)
retention: 604800 # 7 days
# Durability settings (Netflix: 3-way replication)
replication: 3
min_in_sync_replicas: 2
acks: all # Wait for all replicas before ack
# Partitioning (scale writes)
partitions: 16
partition_key: "{key}" # Partition by order key
# Consumer configuration
consumers:
- name: db-applier
type: database_writer
backend: postgres
parallelism: 4 # 4 workers per partition
# Batching (reduce DB round-trips)
batch_size: 1000
batch_timeout: 100ms
# Retry policy (Netflix: exponential backoff)
retry:
max_attempts: 10
base_delay: 100ms
max_delay: 60s
multiplier: 2
# Dead letter queue (after max retries)
dlq:
topic: order-wal-dlq
alert_on_message: true
# Checkpoint configuration
checkpoint:
backend: postgres # or sqlite for local dev
table: wal_checkpoints
interval: 5s # Save checkpoint every 5s
# Read consistency mode
consistency:
read_mode: wal_first # Check WAL before DB
# Read-your-writes guarantee
# If client's last write offset > DB applied offset,
# read from WAL instead of DB
max_lag_tolerance: 1000ms # Alert if lag exceeds 1s
# Durability guarantees
durability:
fsync: true # Kafka fsync before ack
compression: snappy # Reduce storage, maintain perf
# Monitoring
metrics:
- wal_write_latency_p99
- wal_lag_seconds
- unapplied_entries_count
- consumer_throughput
- checkpoint_age_seconds
Multi-Datastore Configuration (Netflix Pattern)
Netflix's WAL commonly writes to multiple datastores simultaneously:
namespaces:
- name: user-profile-mutations
pattern: write-ahead-log
wal:
backend: kafka
topic: user-profile-wal
retention: 604800
replication: 3
# Multiple consumers for multiple datastores
consumers:
- name: cassandra-applier
type: database_writer
backend: cassandra
keyspace: user_profiles
parallelism: 8
- name: elasticsearch-applier
type: search_indexer
backend: elasticsearch
index: user_profiles
parallelism: 4
- name: redis-applier
type: cache_writer
backend: redis
ttl: 3600
parallelism: 2
# All consumers must apply before data considered "applied"
consistency:
read_mode: wal_first
require_all_consumers: true # Wait for all 3 consumers
Client API
Rust Client
use prism_client::{Client, WalWriteOptions};
#[tokio::main]
async fn main() -> Result<()> {
let client = Client::connect("http://localhost:8980").await?;
// Write to WAL (fast path: 1-2ms)
let result = client
.wal_write("order-mutations", "order:12345", order_data)
.await?;
println!("Write complete:");
println!(" WAL offset: {}", result.wal_offset);
println!(" Partition: {}", result.partition);
println!(" Latency: {}ms", result.latency_ms);
// Output: Latency: 1.2ms (Netflix: p99 < 2ms)
// Read with read-your-writes guarantee
let order = client
.wal_read("order-mutations", "order:12345")
.await?;
println!("Read complete:");
println!(" Source: {}", order.source); // "wal" or "database"
println!(" Data: {:?}", order.data);
// Check WAL lag
let health = client.wal_health("order-mutations").await?;
println!("WAL Health:");
println!(" Latest offset: {}", health.latest_offset);
println!(" Applied offset: {}", health.applied_offset);
println!(" Lag: {} entries ({:.2}s)",
health.lag_entries,
health.lag_seconds);
Ok(())
}
Go Client
package main
import (
"context"
"fmt"
prism "github.com/prism/client-go"
)
func main() {
client := prism.NewClient("localhost:8980")
defer client.Close()
ctx := context.Background()
// Write to WAL
result, err := client.WalWrite(ctx, &prism.WalWriteRequest{
Namespace: "order-mutations",
Key: "order:12345",
Data: orderData,
})
if err != nil {
panic(err)
}
fmt.Printf("WAL offset: %d, latency: %dms\n",
result.WalOffset, result.LatencyMs)
// Read with read-your-writes
order, err := client.WalRead(ctx, &prism.WalReadRequest{
Namespace: "order-mutations",
Key: "order:12345",
})
if err != nil {
panic(err)
}
fmt.Printf("Source: %s, Data: %v\n", order.Source, order.Data)
}
Python Client
from prism_client import Client
async def main():
async with Client("http://localhost:8980") as client:
# Write to WAL (fast: 1-2ms)
result = await client.wal_write(
namespace="order-mutations",
key="order:12345",
data=order_data
)
print(f"WAL offset: {result.wal_offset}")
print(f"Latency: {result.latency_ms}ms")
# Read with read-your-writes guarantee
order = await client.wal_read(
namespace="order-mutations",
key="order:12345"
)
print(f"Source: {order.source}") # "wal" or "database"
print(f"Data: {order.data}")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
Implementation Details
WAL Message Schema
message WalEntry {
// Identity
string client_id = 1;
string namespace = 2;
string key = 3;
// Operation
enum Operation {
INSERT = 0;
UPDATE = 1;
DELETE = 2;
UPSERT = 3;
}
Operation operation = 4;
// Payload
bytes data = 5;
// Integrity
string checksum = 6; // SHA256 of data
string idempotency_key = 7; // UUID for deduplication
// Metadata
int64 timestamp_ms = 8;
int32 partition = 9;
int64 offset = 10;
// Ordering (for complex mutations)
int32 sequence_number = 11;
bool is_completion_marker = 12;
}
Checkpoint Schema
CREATE TABLE wal_checkpoints (
consumer_id TEXT NOT NULL,
topic TEXT NOT NULL,
partition INTEGER NOT NULL,
last_applied_offset BIGINT NOT NULL,
timestamp BIGINT NOT NULL,
message_count BIGINT NOT NULL,
PRIMARY KEY (consumer_id, topic, partition)
);
CREATE INDEX idx_checkpoint_timestamp
ON wal_checkpoints (timestamp DESC);
Idempotency Tracking
CREATE TABLE wal_idempotency (
idempotency_key TEXT PRIMARY KEY,
applied_at BIGINT NOT NULL,
wal_offset BIGINT NOT NULL
);
-- TTL: Keep for 7 days (match WAL retention)
CREATE INDEX idx_idempotency_applied_at
ON wal_idempotency (applied_at);
-- Cleanup old entries
DELETE FROM wal_idempotency
WHERE applied_at < (EXTRACT(EPOCH FROM NOW()) * 1000 - 604800000);
Crash Recovery Flow
Consumer Crash Recovery
1. Consumer crashes at offset 500
- Checkpoint saved at offset 500
2. Consumer restarts
- Load checkpoint: last_applied_offset = 500
3. Resume from checkpoint
- Poll Kafka from offset 500
- Kafka returns messages [500..550]
4. Check idempotency
- Query idempotency table for each message
- Skip already-applied messages (500-505 already applied)
- Apply new messages (506-550)
5. Update checkpoint
- Save checkpoint: last_applied_offset = 550
6. Normal operation resumed
Database Corruption Recovery
1. Detect corruption
- Application reports incorrect data
- Investigation finds bad mutation at offset 42,500
2. Stop consumers
- Prevent further writes
3. Restore from backup
- Restore database to state at offset 40,000
4. Selective replay
- Replay WAL from offset 40,000
- Skip offset 42,500 (faulty mutation)
- Apply all other mutations up to latest offset
5. Resume normal operation
- Database consistent
- Consumers resume from latest offset
Monitoring and Alerting
Key Metrics (Netflix-Informed)
# Write latency (target: p99 < 2ms)
prism_wal_write_latency_seconds{
namespace="order-mutations",
quantile="0.99"
} < 0.002
# WAL lag (target: < 1 second)
prism_wal_lag_seconds{
namespace="order-mutations"
} < 1.0
# Unapplied entries (target: < 1000)
prism_wal_unapplied_entries{
namespace="order-mutations"
} < 1000
# Consumer throughput
prism_wal_consumer_throughput{
namespace="order-mutations",
consumer="db-applier"
} > 1000 # msgs/sec
# Checkpoint freshness (target: < 30 seconds)
prism_wal_checkpoint_age_seconds{
namespace="order-mutations",
consumer="db-applier"
} < 30
# Dead letter queue depth (alert if > 0)
prism_wal_dlq_depth{
namespace="order-mutations"
} == 0
Alert Rules
alerts:
- name: HighWALLag
condition: prism_wal_lag_seconds > 10
severity: warning
message: "WAL lag exceeds 10 seconds"
runbook: https://docs.prism.io/runbooks/wal-lag
- name: CriticalWALLag
condition: prism_wal_lag_seconds > 60
severity: critical
message: "WAL lag exceeds 60 seconds - consumer may be stuck"
action: page_oncall
- name: DLQMessagesPresent
condition: prism_wal_dlq_depth > 0
severity: warning
message: "Messages in dead letter queue require manual intervention"
action: create_ticket
- name: CheckpointStale
condition: prism_wal_checkpoint_age_seconds > 300
severity: critical
message: "Checkpoint not updated in 5 minutes - consumer may be dead"
action: page_oncall
Netflix Production Lessons Learned
1. Automatic Scaling is Essential
Netflix: WAL consumers must scale independently of producers. During traffic spikes, producers scale up to handle writes, while consumers maintain steady throughput.
Prism Implementation:
- Kubernetes HPA based on
prism_wal_lag_secondsmetric - Scale consumers up when lag > 10 seconds
- Scale down when lag < 1 second for 5 minutes
2. Adaptive Load Shedding
Netflix: When database is under load, back-pressure can propagate to WAL consumers. Adaptive load shedding prevents cascading failures.
Prism Implementation:
- Monitor database connection pool utilization
- If pool > 90% utilized, reduce consumer batch size
- If pool > 95% utilized, add delay between batches
- Circuit breaker after 3 consecutive batch failures
3. Dedicated Dead Letter Queues
Netflix: Some mutations will never succeed (e.g., bad data format, constraint violation). Don't retry forever.
Prism Implementation:
- After 10 retry attempts, move to DLQ
- DLQ messages trigger alert for manual review
- Operators can replay from DLQ after fixing root cause
- Option to skip DLQ messages during replay
4. Namespace-Driven Configuration
Netflix: Different use cases need different WAL configurations. One size does not fit all.
Prism Implementation:
- Per-namespace retention policies (1 day to 30 days)
- Per-namespace replication factors (1x to 3x)
- Per-namespace consumer parallelism (1 to 16 workers)
- Per-namespace retry policies
5. Observability is Non-Negotiable
Netflix: WAL is infrastructure. Operators must see lag, throughput, errors in real-time.
Prism Implementation:
- Rich metrics (write latency, lag, throughput, DLQ depth)
- Distributed tracing (span from client write to DB apply)
- Structured logging with trace IDs
- Grafana dashboards per namespace
Comparison to Alternatives
WAL vs Direct Database Writes
| Aspect | WAL Pattern | Direct Writes |
|---|---|---|
| Write Latency | 1-2ms (Kafka) | 10-50ms (database) |
| Durability | Kafka 3-way replication | Database replication |
| Crash Recovery | Replay from checkpoint | Database recovery only |
| Traffic Spike Handling | Queue buffers spike | Database saturates |
| Multi-Datastore | Single write, multiple applies | Dual write complexity |
| Complexity | Medium (adds Kafka) | Low (just database) |
When to use WAL:
- ✅ High write throughput (1000s of writes/sec)
- ✅ Need sub-5ms write latency
- ✅ Writing to multiple datastores
- ✅ Need point-in-time recovery
- ✅ Traffic spikes are common
When to use direct writes:
- ✅ Low write volume (<100 writes/sec)
- ✅ Single datastore
- ✅ Simplicity is paramount
- ✅ Strong consistency required (not eventual)
WAL vs Event Sourcing
| Aspect | WAL | Event Sourcing |
|---|---|---|
| Purpose | Fast, durable writes | Full audit trail |
| Retention | 7-30 days | Forever |
| State Reconstruction | From database | From event log |
| Complexity | Medium | High |
| Read Model | Database | Materialized views |
Use WAL when: You need fast writes and durability, not full audit trail. Use Event Sourcing when: You need time-travel queries and complete history.
WAL vs Outbox Pattern
| Aspect | WAL | Outbox |
|---|---|---|
| Write Path | Direct to Kafka | Database transaction + outbox |
| Consistency | Eventual | Transactional |
| Latency | 1-2ms | 10-20ms (database) |
| Use Case | High-throughput writes | Transactional messaging |
Use WAL when: Write throughput is critical, eventual consistency acceptable. Use Outbox when: Need ACID guarantees for event publishing.
Implementation Roadmap
Phase 1: Core WAL Pattern (4 weeks)
- ✅ Kafka producer integration
- ✅ WAL write API (gRPC/HTTP)
- ✅ Basic consumer with database writer
- ✅ Checkpoint management (SQLite)
- ✅ Idempotency tracking
- ✅ Basic monitoring (lag, throughput)
Phase 2: Production Readiness (3 weeks)
- ✅ Dead letter queue
- ✅ Retry with exponential backoff
- ✅ Read-your-writes consistency
- ✅ Crash recovery testing
- ✅ Multi-consumer configuration
- ✅ Grafana dashboards
Phase 3: Netflix-Inspired Features (4 weeks)
- ✅ Adaptive load shedding
- ✅ Automatic consumer scaling
- ✅ Cross-region replication (MirrorMaker)
- ✅ Selective replay (skip faulty mutations)
- ✅ Delayed queue support
- ✅ Performance tuning (batching, compression)
References
Netflix Production Experience
- Netflix Tech Blog: Building a Resilient Data Platform with Write-Ahead Log: Original source of production learnings
- Netflix Data Gateway Reference: Overview of Netflix's data gateway architecture
- Netflix WAL Documentation: Summary of WAL use cases and functionality
Prism Documentation
- MEMO-001: WAL Full Transaction Flow: Detailed sequence diagrams for WAL operations
- RFC-009: Distributed Reliability Patterns: WAL pattern alongside other reliability patterns
- ADR-002: Client-Originated Configuration: How WAL namespaces are configured
Kafka & Durability
- Kafka Documentation: Replication: Kafka's durability guarantees
- Kafka Documentation: Consumer Groups: Consumer offset management
- NATS JetStream Documentation: Alternative durable backend
Related Patterns
- Outbox Pattern (Microservices.io): Transactional alternative to WAL
- Event Sourcing (Martin Fowler): Full audit trail pattern
- Change Data Capture (Debezium): Database WAL-based replication
Revision History
- 2025-11-07: Initial draft grounded in Netflix production experience and learnings
- Documented Netflix's real-world use cases and metrics
- Included production lessons learned (scaling, load shedding, DLQ)
- Added comprehensive monitoring and alerting guidance
- Provided implementation roadmap with Netflix-inspired features