prdproductvisionrequirementsnetflix

Author: Platform TeamCreated: Oct 12, 2025Updated: Oct 12, 2025

PRD-001: Prism Data Access Gateway

Executive Summary

Prism is a high-performance data access gateway that provides unified APIs for heterogeneous backend datastores, enabling application developers to focus on business logic while the platform handles data access complexity, migrations, and operational concerns.

Inspired by Netflix's Data Gateway, Prism adopts proven patterns from Netflix's 8M+ QPS, 3,500+ use-case platform while improving performance (10-100x via Rust), developer experience (client-originated configuration), and operational simplicity (local-first testing, flexible deployment).

Target Launch: Q2 2026 (Phase 1: POCs completed Q1 2026)

Success Metric: 80% of internal microservices use Prism for data access within 12 months of GA

Product Vision

The Problem: Data Access Complexity at Scale

Modern microservices architectures face growing data access challenges:

API Fragmentation: Each datastore (Redis, Postgres, Kafka, DynamoDB) has unique APIs, client libraries, and operational requirements
Migration Complexity: Changing backends requires rewriting application code, extensive testing, and risky deployments
Distributed Systems Knowledge Gap: Most application developers shouldn't need expertise in consistency models, partitioning, replication, and distributed transactions
Operational Burden: Each backend requires separate monitoring, capacity planning, security configuration, and disaster recovery
Pattern Reimplementation: Common patterns (outbox, claim check, sagas) are reimplemented inconsistently across teams

The Solution: Unified Data Access Layer

Prism provides abstraction without compromise:

Unified APIs: Single set of gRPC/HTTP APIs for KeyValue, PubSub, Queue, TimeSeries, Graph, and Document access patterns
Backend Agnostic: Application code unchanged when switching from Redis to DynamoDB, or Kafka to NATS
Semantic Guarantees: Patterns like Multicast Registry coordinate multiple backends atomically
High Performance: Rust-based proxy achieves sub-millisecond p99 latency even at 100K+ RPS
Zero-Downtime Migrations: Shadow traffic and dual-write patterns enable gradual backend changes
Operational Simplicity: Centralized monitoring, security, and capacity management

Strategic Goals

Accelerate Development: Reduce time-to-production for new services by 50% (eliminate backend integration work)
Enable Migrations: Support 3+ major backend migrations per year with zero application code changes
Reduce Operational Cost: Consolidate backend expertise, reduce redundant tooling, optimize resource utilization
Improve Reliability: Provide battle-tested patterns, circuit breaking, load shedding, and failover built-in
Foster Innovation: Allow teams to experiment with new backends without rewriting applications

Market Context

Netflix Data Gateway Learnings

Netflix's Data Gateway serves as our primary inspiration:

Scale Achievements:

8M+ queries per second (key-value abstraction)
10M+ writes per second (time-series data)
3,500+ use cases across the organization
Petabyte-scale storage with low-latency retrieval

Key Lessons Adopted (Netflix Index):

Netflix Lesson	Prism Implementation
Abstraction Simplifies Scale	Layer 1: Primitives (KeyValue, PubSub, Queue, TimeSeries, Graph, Document)
Prioritize Reliability	Circuit breaking, load shedding, failover built-in (ADR-029)
Data Management Critical	TTL, lifecycle policies, tiering strategies first-class (RFC-014)
Sharding for Isolation	Namespace-based isolation, per-tenant deployments (ADR-034)
Zero-Downtime Migrations	Shadow traffic, dual-write patterns, phased cutover (ADR-031)

Prism's Improvements Over Netflix

Aspect	Netflix Approach	Prism Enhancement	Benefit
Proxy Layer	JVM-based gateway	Rust-based (Tokio + Tonic)	10-100x performance, lower resource usage
Configuration	Runtime deployment configs	Client-originated (apps declare needs)	Self-service, reduced ops toil
Testing	Production-validated	Local-first (sqlite, testcontainers)	Fast feedback, deterministic tests
Deployment	Kubernetes-native	Flexible (bare metal, VMs, containers)	Simpler operations, lower cost
Documentation	Internal wiki	Documentation-first (ADRs, RFCs, micro-CMS)	Faster onboarding, knowledge preservation

User Personas

Primary: Application Developer (Backend Engineer)

Profile: Mid-level engineer building microservices, 2-5 years experience, proficient in one language (Go, Python, Rust, Java)

Goals:

Build features quickly without learning distributed systems internals
Use familiar patterns (REST APIs, pub/sub messaging) without backend-specific knowledge
Deploy code confidently without breaking production
Understand system behavior when things go wrong

Pain Points:

Overwhelmed by backend options (Redis, Postgres, Kafka, DynamoDB, Cassandra...)
Spending weeks integrating with each new datastore
Fear of making wrong architectural decisions early
Debugging distributed systems issues without proper training

Prism Value:

✅ Single API for all data access (learn once, use everywhere)
✅ Start with MemStore (in-memory), migrate to Redis/Postgres later without code changes
✅ Pattern library provides proven solutions (Multicast Registry, Saga, Event Sourcing)
✅ Rich error messages and observability built-in

Success Metric: Time from "new service" to "production" < 1 week

Secondary: Platform Engineer (Infrastructure Team)

Profile: Senior engineer responsible for platform services, 5-10 years experience, deep distributed systems knowledge

Goals:

Provide self-service capabilities to application teams
Maintain platform stability (high availability, low latency)
Manage cost and capacity efficiently
Enable safe migrations and experiments

Pain Points:

Supporting N different datastores with N different operational models
Manual work for every new namespace or backend instance
Risk of cascading failures from misbehaving applications
Difficult to enforce best practices (circuit breaking, retries, timeouts)

Prism Value:

✅ Centralized observability and operational controls
✅ Policy enforcement (rate limiting, access control, data governance)
✅ Self-service namespace creation via declarative config
✅ Backend substitutability (migrate Redis → DynamoDB transparently)

Success Metric: Operational incidents reduced by 50%, MTTR < 15 minutes

Tertiary: Data Engineer / Analyst

Profile: Specialist working with analytics, ML pipelines, or data warehousing

Goals:

Access production data for analytics safely
Build ETL pipelines without impacting production services
Integrate with existing analytics tools (Spark, Airflow, Snowflake)

Pain Points:

Direct database access risks impacting production
Inconsistent data formats across microservices
Difficult to maintain data lineage and quality

Prism Value:

✅ Read-only replicas and Change Data Capture (CDC) support
✅ TimeSeries abstraction for metrics and event logs
✅ Graph abstraction for relationship queries
✅ Audit trails and data provenance built-in

Success Metric: Analytics queries don't impact production latency

Core Features

Feature 1: Layered API Architecture

Layer 1: Primitives (Always Available)

Six foundational abstractions that compose to solve 80% of use cases:

Primitive	Purpose	Backend Examples	RFC Reference
KeyValue	Simple storage	Redis, DynamoDB, etcd, Postgres, MemStore	RFC-014
PubSub	Fire-and-forget messaging	NATS, Redis, Kafka (as topic)	RFC-014
Queue	Work distribution	SQS, Postgres, RabbitMQ	RFC-014
Stream	Ordered event log	Kafka, NATS JetStream, Redis Streams	RFC-014
TimeSeries	Temporal data	ClickHouse, TimescaleDB, Prometheus	RFC-014
Graph	Relationships	Neptune, Neo4j, Postgres (recursive CTEs)	RFC-014

Example Usage (Primitives):

from prism import Client

client = Client(endpoint="localhost:8080")

# KeyValue: Simple storage
await client.keyvalue.set("user:123", user_data, ttl=3600)
user = await client.keyvalue.get("user:123")

# PubSub: Broadcast events
await client.pubsub.publish("user-events", event_data)
async for event in client.pubsub.subscribe("user-events"):
    process(event)

# Queue: Background jobs
await client.queue.enqueue("email-jobs", email_task)
task = await client.queue.dequeue("email-jobs", visibility_timeout=30)

Layer 2: Patterns (Use-Case-Specific, Opt-In)

Purpose-built patterns that coordinate multiple backends for common use cases:

Pattern	Solves	Composes	RFC Reference
Multicast Registry	Device management, presence, service discovery	KeyValue + PubSub + Queue	RFC-017
Saga	Distributed transactions	KeyValue + Queue + Compensation	Planned Q2 2026
Event Sourcing	Audit trails, event replay	Stream + KeyValue + Snapshots	Planned Q2 2026
Cache Aside	Read-through caching	KeyValue (cache) + KeyValue (db)	Planned Q3 2026
Outbox	Transactional messaging	KeyValue (tx) + Queue + WAL	Planned Q3 2026

Example Usage (Patterns):

# Multicast Registry: IoT device management
registry = client.multicast_registry("iot-devices")

# Register device with metadata
await registry.register(
    identity="device-sensor-001",
    metadata={"type": "temperature", "location": "building-a", "floor": 3}
)

# Enumerate matching devices
devices = await registry.enumerate(filter={"location": "building-a"})

# Multicast command to filtered subset
result = await registry.multicast(
    filter={"type": "temperature", "floor": 3},
    message={"command": "read", "sample_rate": 5}
)
print(f"Delivered to {result.success_count}/{result.total_count} devices")

Why Layered? (MEMO-005)

Layer 1 for power users who need full control and novel compositions
Layer 2 for most developers who want ergonomic, self-documenting APIs
Choice based on team expertise and use case requirements

User Persona Mapping:

Application Developers: Primarily Layer 2 (80% of use cases)
Platform Engineers: Both layers (Layer 1 for infrastructure, Layer 2 for application teams)
Advanced Users: Layer 1 for custom patterns

Feature 2: Backend Plugin Architecture

Goal: Support 10+ backends without bloating core proxy

Architecture (RFC-008):

┌──────────────────────────────────────────────────────┐
│                 Prism Proxy (Rust Core)              │
│  ┌────────────────────────────────────────────────┐  │
│  │  Layer 1 API: KeyValue, PubSub, Queue, etc.   │  │
│  └────────────────────────────────────────────────┘  │
│                        │                             │
│                        ↓ gRPC                        │
│  ┌────────────────────────────────────────────────┐  │
│  │  Namespace Router (config-driven)              │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘
                         │
            ┌────────────┼────────────┐
            ↓            ↓            ↓
      ┌─────────┐  ┌─────────┐  ┌─────────┐
      │ Redis   │  │ Postgres│  │ Kafka   │
      │ Plugin  │  │ Plugin  │  │ Plugin  │
      │ (Go)    │  │ (Go)    │  │ (Go)    │
      └─────────┘  └─────────┘  └─────────┘

Backend Interface Decomposition (MEMO-006):

Instead of monolithic "Redis backend", each backend advertises thin interfaces:

# Redis implements 24 interfaces
backend: redis
implements:
  - keyvalue_basic        # Set, Get, Delete, Exists
  - keyvalue_scan         # Scan, ScanKeys, Count
  - keyvalue_ttl          # Expire, GetTTL, Persist
  - keyvalue_transactional # MULTI/EXEC
  - keyvalue_batch        # MGET, MSET

  - pubsub_basic          # Publish, Subscribe
  - pubsub_wildcards      # Pattern matching

  - stream_basic          # XADD, XREAD
  - stream_consumer_groups # XGROUP, XREADGROUP
  - stream_replay         # XREAD from offset

  # ...and 14 more (Lists, Sets, SortedSets)

Pattern Slot Matching:

Patterns declare required interfaces for each slot, proxy validates at config time:

pattern: multicast-registry
slots:
  registry:
    required: [keyvalue_basic, keyvalue_scan]
    optional: [keyvalue_ttl]
    recommended: [redis, postgres, dynamodb, etcd]

  messaging:
    required: [pubsub_basic]
    optional: [pubsub_persistent]
    recommended: [nats, kafka, redis]

Backend Priority (MEMO-004):

Phase 1 (Internal Priorities): 0. MemStore (In-memory Go map) - Score: 100/100 - Zero dependencies, instant startup

Kafka - Score: 78/100 - Internal event streaming
NATS - Score: 90/100 - Internal pub/sub messaging
PostgreSQL - Score: 93/100 - Internal relational data
Neptune - Score: 50/100 - Internal graph data

Phase 2 (External/Supporting): 5. Redis - Score: 95/100 - General caching 6. SQLite - Score: 92/100 - Embedded testing 7. S3/MinIO - Score: 85/100 - Large payload handling 8. ClickHouse - Score: 70/100 - Analytics

Feature 3: Client-Originated Configuration

Problem: Traditional approaches require ops teams to provision infrastructure before developers can code.

Prism Approach: Application declares requirements, platform provisions automatically.

Configuration Format:

# Application: prism.yaml (committed to app repo)
namespaces:
  - name: user-sessions
    pattern: keyvalue

    needs:
      latency: p99 < 10ms
      throughput: 50K rps
      ttl: required
      persistence: optional  # Can survive restarts

    backend:
      type: redis       # Explicit choice
      # OR
      auto: true        # Platform selects best match

  - name: notification-queue
    pattern: queue

    needs:
      visibility_timeout: 30s
      dead_letter: true
      throughput: 10K enqueues/sec

    backend:
      type: postgres    # Using Postgres as queue (SKIP LOCKED pattern)

Platform Workflow:

Deploy: Application pushes config to Prism control plane
Validate: Proxy validates requirements are satisfiable
Provision: Backends auto-provisioned (or mapped to existing)
Observe: Namespace metrics tracked, capacity adjusted automatically

Benefits:

✅ Self-service (no ops ticket required)
✅ Version controlled (infrastructure as code in app repo)
✅ Testable (use MemStore in dev, Redis in production)
✅ Evolvable (add needs fields without breaking changes)

Feature 4: Local-First Testing Strategy

Goal: Developers run full Prism stack on laptop with zero cloud dependencies.

Architecture (ADR-004):

# Development workflow
make dev-up    # Start Prism proxy + MemStore (in-process, instant)
make test      # Run tests against local MemStore
make dev-down  # Stop everything

# Integration testing
make integration-up    # Start testcontainers (Redis, Postgres, NATS, Kafka)
make integration-test  # Run full test suite against real backends
make integration-down  # Cleanup

Test Pyramid:

       ┌────────────────┐
       │  E2E Tests     │  Kubernetes, full stack
       │  (10 tests)    │  Runtime: 5 minutes
       └────────────────┘
      ┌──────────────────┐
      │ Integration Tests│  Testcontainers (Redis, Postgres)
      │  (100 tests)     │  Runtime: 2 minutes
      └──────────────────┘
   ┌──────────────────────┐
   │   Unit Tests          │  MemStore (in-memory, no containers)
   │   (1000 tests)        │  Runtime: 10 seconds
   └──────────────────────┘

Backend Substitutability:

Same test suite runs against multiple backends:

// Interface-based acceptance tests
backendDrivers := []BackendSetup{
    {Name: "MemStore", Setup: setupMemStore, SupportsTTL: true},
    {Name: "Redis", Setup: setupRedis, SupportsTTL: true},
    {Name: "Postgres", Setup: setupPostgres, SupportsTTL: false},
}

for _, backend := range backendDrivers {
    t.Run(backend.Name, func(t *testing.T) {
        driver, cleanup := backend.Setup(t)
        defer cleanup()

        // Same test code for all backends
        testKeyValueBasicOperations(t, driver)
        if backend.SupportsTTL {
            testKeyValueTTL(t, driver)
        }
    })
}

Developer Experience:

✅ Unit tests run in <10 seconds (MemStore is instant)
✅ Integration tests run in <2 minutes (testcontainers)
✅ CI/CD fails fast (no waiting for cloud resources)
✅ Deterministic (no flaky tests from network/cloud issues)

Feature 5: Zero-Downtime Migrations

Goal: Change backends without application code changes or service interruptions.

Migration Patterns (ADR-031):

Pattern 1: Dual-Write (Postgres → DynamoDB example)

# Phase 1: Dual-write to both backends
namespace: user-profiles
migration:
  strategy: dual-write
  primary: postgres      # Reads from here
  shadow: dynamodb       # Writes to both, reads for comparison

# Phase 2: Switch primary (traffic cutover)
namespace: user-profiles
migration:
  strategy: dual-write
  primary: dynamodb      # Reads from here
  shadow: postgres       # Still writing to both

# Phase 3: Complete migration (remove shadow)
namespace: user-profiles
backend: dynamodb

Observability During Migration:

Consistency diff percentage (shadow reads vs primary reads)
Latency comparison (primary vs shadow)
Error rates per backend
Data completeness metrics

Pattern 2: Shadow Traffic (Kafka → NATS example)

# Phase 1: Shadow traffic to new backend
namespace: events
migration:
  strategy: shadow
  primary: kafka       # All production traffic
  shadow: nats         # Copy of traffic (metrics only)

# Observe: Validate NATS can handle load, latency acceptable

# Phase 2: Percentage cutover
namespace: events
migration:
  strategy: percentage
  backends:
    - nats: 10%      # 10% of traffic
    - kafka: 90%

# Phase 3: Full cutover
namespace: events
backend: nats

Safety Guarantees:

✅ Automatic rollback on error rate spike
✅ Circuit breaker prevents cascading failures
✅ Data consistency validation before full cutover
✅ Application code unchanged throughout migration

Feature 6: Documentation-First Development

Goal: Design before implementation, preserve decisions permanently.

Workflow (MEMO-003):

┌──────────────────────────────────────────────────────┐
│ 1. Design Phase: Write RFC/ADR with diagrams        │
│    - Mermaid sequence diagrams for flows             │
│    - Code examples that compile                      │
│    - Trade-offs explicitly documented                │
│    Duration: 1-2 days                                │
└──────────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────────┐
│ 2. Review Phase: Team feedback on design            │
│    - Async review via GitHub PR                      │
│    - Live preview with Docusaurus (instant feedback) │
│    - Iterate on design (not code)                    │
│    Duration: 2-3 days                                │
└──────────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────────┐
│ 3. Implementation Phase: Code follows design        │
│    - RFC is the spec (not implementation detail)     │
│    - Tests match documented examples                 │
│    - Zero design rework                              │
│    Duration: 5-7 days                                │
└──────────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────────┐
│ 4. Validation Phase: Verify code matches docs       │
│    - Link PRs to RFCs                                │
│    - Update docs if implementation diverged          │
│    - Maintain living documentation                   │
└──────────────────────────────────────────────────────┘

Documentation Types:

Type	Purpose	Example
ADR (Architecture Decision Record)	Why we made significant architectural choices	ADR-001: Rust for Proxy
RFC (Request for Comments)	Complete technical specification for features	RFC-010: Admin Protocol
MEMO	Analysis, reviews, process improvements	MEMO-004: Backend Implementation Guide

Micro-CMS Advantage:

Prism uses Docusaurus + GitHub Pages as a "micro-CMS":

✅ Rendered Mermaid diagrams (understand flows instantly)
✅ Syntax-highlighted code examples (copy-paste ready)
✅ Full-text search (find answers in seconds)
✅ Cross-referenced knowledge graph (ADRs ↔ RFCs ↔ MEMOs)
✅ Live preview (see changes in <1 second)
✅ Professional appearance (builds trust with stakeholders)

Impact:

Design flaws caught before implementation (cost: 1 hour to fix RFC vs 1 week to refactor code)
New team members productive in <1 week (read docs, not code)
Decisions preserved permanently (no tribal knowledge loss)

Technical Requirements

Performance Requirements

Metric	Target	Measurement Method
Latency (p50)	<1ms	End-to-end client → proxy → backend → client
Latency (p99)	<10ms	Excludes backend latency (measure proxy overhead)
Latency (p99.9)	<50ms	With load shedding and circuit breaking active
Throughput	100K+ RPS	Single proxy instance on 4-core VM
Concurrency	10K+ connections	Simultaneous client connections per proxy
Memory	<500MB baseline	Proxy memory usage at idle
CPU	<30% at 50K RPS	Proxy CPU usage under load

Rationale: Netflix's Java-based gateway achieves 8M QPS across cluster. Prism targets 100K RPS per instance (10-100x more efficient) via Rust's zero-cost abstractions and Tokio async runtime.

Reliability Requirements

Requirement	Target	Implementation
Availability	99.99% (52 min downtime/year)	Multi-region deployment, health checks, auto-restart
Circuit Breaking	Trip after 5 consecutive failures	Per-backend circuit breaker, 30s recovery window
Load Shedding	Shed requests at 90% capacity	Priority-based queuing, graceful degradation
Failover	<5s to switch to replica	Automatic health-check-based failover
Data Durability	Zero message loss (Queue pattern)	At-least-once delivery, persistent queue backends
Consistency	Configurable (eventual → strong)	Per-namespace consistency level declaration

Security Requirements

Requirement	Implementation	Reference
Authentication	OIDC (OAuth2) for admin API, mTLS for data plane	RFC-010, ADR-046
Authorization	Namespace-based RBAC, OPA integration	RFC-011
Encryption	TLS 1.3 for all communication	ADR-047
Audit Logging	All data access logged with user context	RFC-010
PII Handling	Automatic encryption/masking via proto tags	ADR-003
Secrets Management	HashiCorp Vault integration	Planned Q2 2026

Observability Requirements

Signal	Collection Method	Storage	Retention
Metrics	OpenTelemetry (Prometheus format)	Local Signoz instance (dev), Prometheus (prod)	90 days
Traces	OpenTelemetry (OTLP)	Signoz (dev), Jaeger (prod)	30 days
Logs	Structured JSON (slog)	Signoz (dev), Loki (prod)	14 days
Profiles	pprof (Go plugins), perf (Rust proxy)	S3 (long-term)	7 days active

Key Metrics to Track:

Request rate (RPS) per namespace
Latency histogram per namespace per backend
Error rate per namespace per backend
Backend health (up/down, latency, capacity)
Cache hit rate (if caching enabled)
Migration progress (dual-write consistency %)

Scalability Requirements

Dimension	Target	Strategy
Namespaces	1,000+ per proxy	Namespace isolation, lightweight routing
Backends	100+ unique backend instances	Plugin architecture, lazy loading
Clients	10,000+ concurrent clients	Connection pooling, multiplexing
Message Size	Up to 5GB (via Claim Check)	Automatic large payload handling
Retention	30+ days (streams, queues)	Backend-native retention policies

Success Metrics

Product Adoption (Primary Metric)

Goal: 80% of internal microservices use Prism within 12 months of GA

Measurement:

Number of namespaces created
Number of unique applications using Prism
RPS through Prism vs direct backend access
% of new services that use Prism (target: 100%)

Milestone Targets:

Month 3: 10 early adopters (friendly teams)
Month 6: 30% of internal services
Month 9: 60% of internal services
Month 12: 80% of internal services

Developer Productivity

Goal: 50% reduction in time-to-production for new services

Metric	Before Prism	With Prism	Measurement
Time to Production	4-6 weeks	1-2 weeks	Repo created → first production request
Platform Tickets	Baseline	50% reduction	Monthly support ticket volume
Developer Satisfaction	N/A	>80% "would recommend"	Quarterly survey

Operational Efficiency

Goal: Reduce data access incidents by 50%, MTTR <15 minutes

Metric	Baseline	Target (6 months)	Measurement
Incidents	20/month	10/month	Data access-related incidents
MTTR	Variable	<15 minutes	Mean time to resolution
On-Call Pages	Baseline	50% reduction	Backend-related pages

Migration Velocity

Goal: Enable 3+ backend migrations/year with zero application code changes

Phase	Timeline	Target	Measurement
Phase 1	2026	1 migration	Redis → DynamoDB
Phase 2	2027+	3+ migrations/year	Successful cutover count
Code Changes	All phases	Zero	Lines of application code changed

Performance

Goal: p99 latency <10ms, 100K+ RPS per instance

Measurement:

Latency histogram (p50, p95, p99, p99.9)
Throughput (RPS) per instance
Resource utilization (CPU, memory)

Target:

p99 latency: <10ms (excluding backend latency)
Throughput: 100K RPS (4-core VM)
CPU: <30% at 50K RPS

Release Phases

Phase 0: POC Validation (Q4 2025 - Q1 2026) ✅ In Progress

Goal: Prove core architecture with minimal scope

Deliverables:

POC 1: KeyValue pattern with MemStore (2 weeks) → RFC-018
POC 2: KeyValue pattern with Redis (2 weeks)
POC 3: PubSub pattern with NATS (2 weeks)
POC 4: Multicast Registry pattern (3 weeks)
POC 5: Authentication (Admin Protocol with OIDC) (2 weeks)

Success Criteria:

All POCs demonstrate end-to-end flow (client → proxy → backend)
Performance targets met (p99 <10ms, 100K RPS)
Tests pass against multiple backends (MemStore, Redis, NATS)

Status: POCs 1-3 completed, POC 4-5 in progress

Phase 1: Alpha Release (Q2 2026)

Goal: Internal dogfooding with friendly teams

Scope:

✅ Layer 1 Primitives: KeyValue, PubSub, Queue
✅ Backends: MemStore, Redis, Postgres, NATS, Kafka
✅ Admin API: Namespace CRUD, health checks, metrics
✅ Client SDKs: Python, Go, Rust
❌ Layer 2 Patterns: Not included (only primitives)
❌ Migrations: Not supported yet

Target Users: 5-10 internal early adopter teams

Success Criteria:

10 namespaces in production
10K+ RPS sustained
Zero critical bugs for 2 consecutive weeks
Developer feedback: "would recommend" >80%

Risk Mitigation:

Feature flags for gradual rollout
Shadow traffic only (no primary traffic yet)
24/7 on-call support for early adopters

Phase 2: Beta Release (Q3 2026)

Goal: Production-ready for core use cases

Scope:

✅ All Phase 1 features
✅ Layer 2 Patterns: Multicast Registry, Cache Aside
✅ Migrations: Dual-write pattern
✅ Observability: Full OpenTelemetry integration
✅ Security: OIDC authentication, RBAC authorization
❌ Advanced patterns (Saga, Event Sourcing): Not yet

Target Users: 30% of internal services (~50 services)

Success Criteria:

100+ namespaces in production
500K+ RPS sustained
99.9% availability (month-over-month)
1 successful migration (Redis → DynamoDB)

Marketing:

Internal tech talks (bi-weekly)
Comprehensive documentation site
Getting started guides and templates
Office hours (weekly)

Phase 3: GA Release (Q4 2026)

Goal: General availability for all internal teams

Scope:

✅ All Phase 2 features
✅ Layer 2 Patterns: Saga, Event Sourcing, Work Queue
✅ Backends: All planned backends (8 total)
✅ Advanced migrations: Shadow traffic, percentage cutover
✅ Self-service: Namespace creation via GitOps

Target Users: 80% of internal services (~200 services)

Success Criteria:

500+ namespaces in production
5M+ RPS sustained
99.99% availability (quarterly)
3 successful migrations

Support:

SLA-backed support (8x5 initially, 24x7 by Q1 2027)
Dedicated Slack channel
Runbook for common issues
Incident response plan

Phase 4: Ecosystem Growth (2027+)

Goal: Become the default data access layer

Scope:

✅ External backends: AWS (DynamoDB, S3, SQS), GCP (Datastore, Pub/Sub)
✅ Community patterns: 3rd-party contributed patterns
✅ Client SDKs: Java, TypeScript, C#
✅ Integrations: Kubernetes Operator, Terraform Provider, Helm Charts

Target Users: 100% of internal services + select external partners

Success Criteria:

1,000+ namespaces
10M+ RPS sustained
99.99% availability (SLA-backed)
5+ community-contributed backend plugins
10+ community-contributed patterns

Ecosystem:

Open-source core proxy
Plugin marketplace
Pattern certification program
Annual user conference

Risks and Mitigations

Risk 1: Adoption Resistance (High)

Risk: Teams prefer using backends directly (fear of abstraction overhead)

Mitigation:

✅ Prove performance: Publish benchmarks showing <1ms overhead
✅ Early wins: Work with friendly teams, showcase success stories
✅ Incremental adoption: Allow hybrid (some namespaces via Prism, some direct)
✅ Developer experience: Make Prism easier than direct integration (generators, templates)

Ownership: Product Manager + Developer Relations

Risk 2: Performance Bottleneck (Medium)

Risk: Proxy becomes bottleneck at scale (CPU, memory, network)

Mitigation:

✅ Rust performance: Leverage zero-cost abstractions, async runtime
✅ Benchmarking: Continuous performance regression testing
✅ Horizontal scaling: Stateless proxy, easy to scale out
✅ Bypass mode: Critical paths can bypass proxy if needed

Ownership: Performance Engineer + SRE

Risk 3: Backend-Specific Features (Medium)

Risk: Teams need backend-specific features not abstracted by Prism

Mitigation:

✅ Layer 1 escape hatch: Low-level primitives allow direct control
✅ Backend-specific extensions: Optional proto extensions per backend
✅ Passthrough mode: Raw query mode for specialized cases
✅ Feedback loop: Prioritize frequently requested features

Ownership: Platform Engineer + Product Manager

Risk 4: Migration Complexity (High)

Risk: Dual-write and shadow traffic patterns introduce data consistency issues

Mitigation:

✅ Consistency validation: Automated diff detection and alerting
✅ Rollback plan: Instant rollback on error rate spike
✅ Gradual rollout: Percentage cutover (1% → 10% → 50% → 100%)
✅ Dry-run mode: Test migration without impacting production

Ownership: SRE + Database Engineer

Risk 5: Operational Complexity (Medium)

Risk: Prism adds another component to debug, increasing operational burden

Mitigation:

✅ Centralized observability: All signals (metrics, traces, logs) in one place
✅ Health checks: Automated detection and remediation
✅ Runbooks: Comprehensive troubleshooting guides
✅ Self-healing: Automatic restarts, circuit breaking, load shedding

Ownership: SRE + DevOps

Open Questions

Question 1: Should Layer 2 Patterns Be Open-Sourced?

Context: Layer 1 (primitives) are generic and reusable. Layer 2 (patterns) may encode internal business logic.

Options:

Option A: Open-source all patterns (maximum community value)
Option B: Open-source generic patterns only (Multicast Registry, Saga), keep business-specific private
Option C: All patterns internal initially, evaluate open-source later

Recommendation: Option B (selective open-source) - generic patterns have broad applicability, business-specific stay internal

Decision Needed By: Q2 2026 (before Beta release)

Question 2: What is the Pricing Model (If External)?

Context: If Prism is offered as managed service to external customers, what pricing model makes sense?

Options:

Option A: RPS-based (per million requests)
Option B: Namespace-based (per active namespace)
Option C: Resource-based (CPU/memory allocation)
Option D: Free tier + enterprise support

Recommendation: Start with internal-only (no pricing), evaluate external offering in 2027

Decision Needed By: Q4 2026 (if external offering considered)

Question 3: How Do We Handle Schema Evolution?

Context: Protobuf schemas will evolve (new fields, deprecated methods). How do we maintain compatibility?

Options:

Option A: Strict versioning (v1, v2 incompatible)
Option B: Backward-compatible only (always additive)
Option C: API versioning per namespace (clients pin versions)

Recommendation: Option B + C hybrid (backward-compatible by default, namespaces can pin versions)

Decision Needed By: Q1 2026 (before Alpha)

Appendix

Competitive Landscape

Product	Approach	Strengths	Weaknesses	Differentiation
Netflix Data Gateway	JVM-based proxy	Battle-tested at scale	Proprietary, JVM overhead	Rust performance, local-first testing
AWS AppSync	Managed GraphQL	Serverless, fully managed	AWS-only, GraphQL-specific	Multi-cloud, gRPC/HTTP APIs
Hasura	GraphQL over Postgres	Instant GraphQL API	Postgres-only initially	Multi-backend, pattern library
Kong / Envoy	API Gateway	HTTP/gRPC proxy	No data abstraction	Data-aware patterns (not just routing)
Direct SDK	Client libraries	No additional hop	Tight coupling, hard to migrate	Loose coupling, easy migrations

Prism's Unique Value:

Performance: Rust-based, 10-100x better than JVM alternatives
Flexibility: Works with any backend (not locked to AWS/Postgres)
Patterns: High-level abstractions (not just API gateway)
Local-First: Full stack runs on laptop (not just cloud)

References

Netflix Data Gateway:

Netflix Index - Overview and key learnings
Netflix Summary - Lessons learned
Netflix Abstractions - Data models (KeyValue, TimeSeries, Counter, WAL)
Netflix Key Use Cases - Real-world applications

Prism Architecture:

ADR-001: Rust for Proxy - Why Rust over Go/Java
RFC-008: Proxy Plugin Architecture - Backend plugin system
RFC-014: Layered Data Access Patterns - Layer 1 primitives
RFC-017: Multicast Registry Pattern - First Layer 2 pattern
RFC-018: POC Implementation Strategy - Phased rollout plan

Design Philosophy:

MEMO-003: Documentation-First Development - Design before code
MEMO-004: Backend Plugin Implementation Guide - Backend priorities
MEMO-005: Client Protocol Design Philosophy - Layered API architecture
MEMO-006: Backend Interface Decomposition - Schema registry

Revision History

2025-10-12: Initial PRD based on Netflix learnings and Prism architecture memos
Future: Updates as product evolves

Approvals

Product Owner: [Name] - Approved [Date]

Engineering Lead: [Name] - Approved [Date]

Architecture Review: [Name] - Approved [Date]

Executive Summary​

Product Vision​

The Problem: Data Access Complexity at Scale​

The Solution: Unified Data Access Layer​

Strategic Goals​

Market Context​

Netflix Data Gateway Learnings​

Prism's Improvements Over Netflix​

User Personas​

Primary: Application Developer (Backend Engineer)​

Secondary: Platform Engineer (Infrastructure Team)​

Tertiary: Data Engineer / Analyst​

Core Features​

Feature 1: Layered API Architecture​

Feature 2: Backend Plugin Architecture​

Feature 3: Client-Originated Configuration​

Feature 4: Local-First Testing Strategy​

Feature 5: Zero-Downtime Migrations​

Pattern 1: Dual-Write (Postgres → DynamoDB example)​

Pattern 2: Shadow Traffic (Kafka → NATS example)​

Feature 6: Documentation-First Development​

Technical Requirements​

Performance Requirements​

Reliability Requirements​

Security Requirements​

Observability Requirements​

Scalability Requirements​

Success Metrics​

Product Adoption (Primary Metric)​

Developer Productivity​

Operational Efficiency​

Migration Velocity​

Performance​

Release Phases​

Phase 0: POC Validation (Q4 2025 - Q1 2026) ✅ In Progress​

Phase 1: Alpha Release (Q2 2026)​

Phase 2: Beta Release (Q3 2026)​

Phase 3: GA Release (Q4 2026)​

Phase 4: Ecosystem Growth (2027+)​

Risks and Mitigations​

Risk 1: Adoption Resistance (High)​

Risk 2: Performance Bottleneck (Medium)​

Risk 3: Backend-Specific Features (Medium)​

Risk 4: Migration Complexity (High)​

Risk 5: Operational Complexity (Medium)​

Open Questions​

Question 1: Should Layer 2 Patterns Be Open-Sourced?​

Question 2: What is the Pricing Model (If External)?​

Question 3: How Do We Handle Schema Evolution?​

Appendix​

Competitive Landscape​

References​

Revision History​

Approvals​

Executive Summary

Product Vision

The Problem: Data Access Complexity at Scale

The Solution: Unified Data Access Layer

Strategic Goals

Market Context

Netflix Data Gateway Learnings

Prism's Improvements Over Netflix

User Personas

Primary: Application Developer (Backend Engineer)

Secondary: Platform Engineer (Infrastructure Team)

Tertiary: Data Engineer / Analyst

Core Features

Feature 1: Layered API Architecture

Feature 2: Backend Plugin Architecture

Feature 3: Client-Originated Configuration

Feature 4: Local-First Testing Strategy

Feature 5: Zero-Downtime Migrations

Pattern 1: Dual-Write (Postgres → DynamoDB example)

Pattern 2: Shadow Traffic (Kafka → NATS example)

Feature 6: Documentation-First Development

Technical Requirements

Performance Requirements

Reliability Requirements

Security Requirements

Observability Requirements

Scalability Requirements

Success Metrics

Product Adoption (Primary Metric)

Developer Productivity

Operational Efficiency

Migration Velocity

Performance

Release Phases

Phase 0: POC Validation (Q4 2025 - Q1 2026) ✅ In Progress

Phase 1: Alpha Release (Q2 2026)

Phase 2: Beta Release (Q3 2026)

Phase 3: GA Release (Q4 2026)

Phase 4: Ecosystem Growth (2027+)

Risks and Mitigations

Risk 1: Adoption Resistance (High)

Risk 2: Performance Bottleneck (Medium)

Risk 3: Backend-Specific Features (Medium)

Risk 4: Migration Complexity (High)

Risk 5: Operational Complexity (Medium)

Open Questions

Question 1: Should Layer 2 Patterns Be Open-Sourced?

Question 2: What is the Pricing Model (If External)?

Question 3: How Do We Handle Schema Evolution?

Appendix

Competitive Landscape

References

Revision History

Approvals