Skip to main content

PRD-001: Prism Data Access Gateway

Executive Summary

Prism is a high-performance data access gateway that provides unified APIs for heterogeneous backend datastores, enabling application developers to focus on business logic while the platform handles data access complexity, migrations, and operational concerns.

Inspired by Netflix's Data Gateway, Prism adopts proven patterns from Netflix's 8M+ QPS, 3,500+ use-case platform while improving performance (10-100x via Rust), developer experience (client-originated configuration), and operational simplicity (local-first testing, flexible deployment).

Target Launch: Q2 2026 (Phase 1: POCs completed Q1 2026)

Success Metric: 80% of internal microservices use Prism for data access within 12 months of GA


Product Vision

The Problem: Data Access Complexity at Scale

Modern microservices architectures face growing data access challenges:

  1. API Fragmentation: Each datastore (Redis, Postgres, Kafka, DynamoDB) has unique APIs, client libraries, and operational requirements
  2. Migration Complexity: Changing backends requires rewriting application code, extensive testing, and risky deployments
  3. Distributed Systems Knowledge Gap: Most application developers shouldn't need expertise in consistency models, partitioning, replication, and distributed transactions
  4. Operational Burden: Each backend requires separate monitoring, capacity planning, security configuration, and disaster recovery
  5. Pattern Reimplementation: Common patterns (outbox, claim check, sagas) are reimplemented inconsistently across teams

The Solution: Unified Data Access Layer

Prism provides abstraction without compromise:

  • Unified APIs: Single set of gRPC/HTTP APIs for KeyValue, PubSub, Queue, TimeSeries, Graph, and Document access patterns
  • Backend Agnostic: Application code unchanged when switching from Redis to DynamoDB, or Kafka to NATS
  • Semantic Guarantees: Patterns like Multicast Registry coordinate multiple backends atomically
  • High Performance: Rust-based proxy achieves sub-millisecond p99 latency even at 100K+ RPS
  • Zero-Downtime Migrations: Shadow traffic and dual-write patterns enable gradual backend changes
  • Operational Simplicity: Centralized monitoring, security, and capacity management

Strategic Goals

  1. Accelerate Development: Reduce time-to-production for new services by 50% (eliminate backend integration work)
  2. Enable Migrations: Support 3+ major backend migrations per year with zero application code changes
  3. Reduce Operational Cost: Consolidate backend expertise, reduce redundant tooling, optimize resource utilization
  4. Improve Reliability: Provide battle-tested patterns, circuit breaking, load shedding, and failover built-in
  5. Foster Innovation: Allow teams to experiment with new backends without rewriting applications

Market Context

Netflix Data Gateway Learnings

Netflix's Data Gateway serves as our primary inspiration:

Scale Achievements:

  • 8M+ queries per second (key-value abstraction)
  • 10M+ writes per second (time-series data)
  • 3,500+ use cases across the organization
  • Petabyte-scale storage with low-latency retrieval

Key Lessons Adopted (Netflix Index):

Netflix LessonPrism Implementation
Abstraction Simplifies ScaleLayer 1: Primitives (KeyValue, PubSub, Queue, TimeSeries, Graph, Document)
Prioritize ReliabilityCircuit breaking, load shedding, failover built-in (ADR-029)
Data Management CriticalTTL, lifecycle policies, tiering strategies first-class (RFC-014)
Sharding for IsolationNamespace-based isolation, per-tenant deployments (ADR-034)
Zero-Downtime MigrationsShadow traffic, dual-write patterns, phased cutover (ADR-031)

Prism's Improvements Over Netflix

AspectNetflix ApproachPrism EnhancementBenefit
Proxy LayerJVM-based gatewayRust-based (Tokio + Tonic)10-100x performance, lower resource usage
ConfigurationRuntime deployment configsClient-originated (apps declare needs)Self-service, reduced ops toil
TestingProduction-validatedLocal-first (sqlite, testcontainers)Fast feedback, deterministic tests
DeploymentKubernetes-nativeFlexible (bare metal, VMs, containers)Simpler operations, lower cost
DocumentationInternal wikiDocumentation-first (ADRs, RFCs, micro-CMS)Faster onboarding, knowledge preservation

User Personas

Primary: Application Developer (Backend Engineer)

Profile: Mid-level engineer building microservices, 2-5 years experience, proficient in one language (Go, Python, Rust, Java)

Goals:

  • Build features quickly without learning distributed systems internals
  • Use familiar patterns (REST APIs, pub/sub messaging) without backend-specific knowledge
  • Deploy code confidently without breaking production
  • Understand system behavior when things go wrong

Pain Points:

  • Overwhelmed by backend options (Redis, Postgres, Kafka, DynamoDB, Cassandra...)
  • Spending weeks integrating with each new datastore
  • Fear of making wrong architectural decisions early
  • Debugging distributed systems issues without proper training

Prism Value:

  • ✅ Single API for all data access (learn once, use everywhere)
  • ✅ Start with MemStore (in-memory), migrate to Redis/Postgres later without code changes
  • ✅ Pattern library provides proven solutions (Multicast Registry, Saga, Event Sourcing)
  • ✅ Rich error messages and observability built-in

Success Metric: Time from "new service" to "production" < 1 week


Secondary: Platform Engineer (Infrastructure Team)

Profile: Senior engineer responsible for platform services, 5-10 years experience, deep distributed systems knowledge

Goals:

  • Provide self-service capabilities to application teams
  • Maintain platform stability (high availability, low latency)
  • Manage cost and capacity efficiently
  • Enable safe migrations and experiments

Pain Points:

  • Supporting N different datastores with N different operational models
  • Manual work for every new namespace or backend instance
  • Risk of cascading failures from misbehaving applications
  • Difficult to enforce best practices (circuit breaking, retries, timeouts)

Prism Value:

  • ✅ Centralized observability and operational controls
  • ✅ Policy enforcement (rate limiting, access control, data governance)
  • ✅ Self-service namespace creation via declarative config
  • ✅ Backend substitutability (migrate Redis → DynamoDB transparently)

Success Metric: Operational incidents reduced by 50%, MTTR < 15 minutes


Tertiary: Data Engineer / Analyst

Profile: Specialist working with analytics, ML pipelines, or data warehousing

Goals:

  • Access production data for analytics safely
  • Build ETL pipelines without impacting production services
  • Integrate with existing analytics tools (Spark, Airflow, Snowflake)

Pain Points:

  • Direct database access risks impacting production
  • Inconsistent data formats across microservices
  • Difficult to maintain data lineage and quality

Prism Value:

  • ✅ Read-only replicas and Change Data Capture (CDC) support
  • ✅ TimeSeries abstraction for metrics and event logs
  • ✅ Graph abstraction for relationship queries
  • ✅ Audit trails and data provenance built-in

Success Metric: Analytics queries don't impact production latency


Core Features

Feature 1: Layered API Architecture

Layer 1: Primitives (Always Available)

Six foundational abstractions that compose to solve 80% of use cases:

PrimitivePurposeBackend ExamplesRFC Reference
KeyValueSimple storageRedis, DynamoDB, etcd, Postgres, MemStoreRFC-014
PubSubFire-and-forget messagingNATS, Redis, Kafka (as topic)RFC-014
QueueWork distributionSQS, Postgres, RabbitMQRFC-014
StreamOrdered event logKafka, NATS JetStream, Redis StreamsRFC-014
TimeSeriesTemporal dataClickHouse, TimescaleDB, PrometheusRFC-014
GraphRelationshipsNeptune, Neo4j, Postgres (recursive CTEs)RFC-014

Example Usage (Primitives):

from prism import Client

client = Client(endpoint="localhost:8080")

# KeyValue: Simple storage
await client.keyvalue.set("user:123", user_data, ttl=3600)
user = await client.keyvalue.get("user:123")

# PubSub: Broadcast events
await client.pubsub.publish("user-events", event_data)
async for event in client.pubsub.subscribe("user-events"):
process(event)

# Queue: Background jobs
await client.queue.enqueue("email-jobs", email_task)
task = await client.queue.dequeue("email-jobs", visibility_timeout=30)

Layer 2: Patterns (Use-Case-Specific, Opt-In)

Purpose-built patterns that coordinate multiple backends for common use cases:

PatternSolvesComposesRFC Reference
Multicast RegistryDevice management, presence, service discoveryKeyValue + PubSub + QueueRFC-017
SagaDistributed transactionsKeyValue + Queue + CompensationPlanned Q2 2026
Event SourcingAudit trails, event replayStream + KeyValue + SnapshotsPlanned Q2 2026
Cache AsideRead-through cachingKeyValue (cache) + KeyValue (db)Planned Q3 2026
OutboxTransactional messagingKeyValue (tx) + Queue + WALPlanned Q3 2026

Example Usage (Patterns):

# Multicast Registry: IoT device management
registry = client.multicast_registry("iot-devices")

# Register device with metadata
await registry.register(
identity="device-sensor-001",
metadata={"type": "temperature", "location": "building-a", "floor": 3}
)

# Enumerate matching devices
devices = await registry.enumerate(filter={"location": "building-a"})

# Multicast command to filtered subset
result = await registry.multicast(
filter={"type": "temperature", "floor": 3},
message={"command": "read", "sample_rate": 5}
)
print(f"Delivered to {result.success_count}/{result.total_count} devices")

Why Layered? (MEMO-005)

  • Layer 1 for power users who need full control and novel compositions
  • Layer 2 for most developers who want ergonomic, self-documenting APIs
  • Choice based on team expertise and use case requirements

User Persona Mapping:

  • Application Developers: Primarily Layer 2 (80% of use cases)
  • Platform Engineers: Both layers (Layer 1 for infrastructure, Layer 2 for application teams)
  • Advanced Users: Layer 1 for custom patterns

Feature 2: Backend Plugin Architecture

Goal: Support 10+ backends without bloating core proxy

Architecture (RFC-008):

┌──────────────────────────────────────────────────────┐
│ Prism Proxy (Rust Core) │
│ ┌────────────────────────────────────────────────┐ │
│ │ Layer 1 API: KeyValue, PubSub, Queue, etc. │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ ↓ gRPC │
│ ┌────────────────────────────────────────────────┐ │
│ │ Namespace Router (config-driven) │ │
│ └────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

┌────────────┼────────────┐
↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Redis │ │ Postgres│ │ Kafka │
│ Plugin │ │ Plugin │ │ Plugin │
│ (Go) │ │ (Go) │ │ (Go) │
└─────────┘ └─────────┘ └─────────┘

Backend Interface Decomposition (MEMO-006):

Instead of monolithic "Redis backend", each backend advertises thin interfaces:

# Redis implements 24 interfaces
backend: redis
implements:
- keyvalue_basic # Set, Get, Delete, Exists
- keyvalue_scan # Scan, ScanKeys, Count
- keyvalue_ttl # Expire, GetTTL, Persist
- keyvalue_transactional # MULTI/EXEC
- keyvalue_batch # MGET, MSET

- pubsub_basic # Publish, Subscribe
- pubsub_wildcards # Pattern matching

- stream_basic # XADD, XREAD
- stream_consumer_groups # XGROUP, XREADGROUP
- stream_replay # XREAD from offset

# ...and 14 more (Lists, Sets, SortedSets)

Pattern Slot Matching:

Patterns declare required interfaces for each slot, proxy validates at config time:

pattern: multicast-registry
slots:
registry:
required: [keyvalue_basic, keyvalue_scan]
optional: [keyvalue_ttl]
recommended: [redis, postgres, dynamodb, etcd]

messaging:
required: [pubsub_basic]
optional: [pubsub_persistent]
recommended: [nats, kafka, redis]

Backend Priority (MEMO-004):

Phase 1 (Internal Priorities): 0. MemStore (In-memory Go map) - Score: 100/100 - Zero dependencies, instant startup

  1. Kafka - Score: 78/100 - Internal event streaming
  2. NATS - Score: 90/100 - Internal pub/sub messaging
  3. PostgreSQL - Score: 93/100 - Internal relational data
  4. Neptune - Score: 50/100 - Internal graph data

Phase 2 (External/Supporting): 5. Redis - Score: 95/100 - General caching 6. SQLite - Score: 92/100 - Embedded testing 7. S3/MinIO - Score: 85/100 - Large payload handling 8. ClickHouse - Score: 70/100 - Analytics


Feature 3: Client-Originated Configuration

Problem: Traditional approaches require ops teams to provision infrastructure before developers can code.

Prism Approach: Application declares requirements, platform provisions automatically.

Configuration Format:

# Application: prism.yaml (committed to app repo)
namespaces:
- name: user-sessions
pattern: keyvalue

needs:
latency: p99 < 10ms
throughput: 50K rps
ttl: required
persistence: optional # Can survive restarts

backend:
type: redis # Explicit choice
# OR
auto: true # Platform selects best match

- name: notification-queue
pattern: queue

needs:
visibility_timeout: 30s
dead_letter: true
throughput: 10K enqueues/sec

backend:
type: postgres # Using Postgres as queue (SKIP LOCKED pattern)

Platform Workflow:

  1. Deploy: Application pushes config to Prism control plane
  2. Validate: Proxy validates requirements are satisfiable
  3. Provision: Backends auto-provisioned (or mapped to existing)
  4. Observe: Namespace metrics tracked, capacity adjusted automatically

Benefits:

  • ✅ Self-service (no ops ticket required)
  • ✅ Version controlled (infrastructure as code in app repo)
  • ✅ Testable (use MemStore in dev, Redis in production)
  • ✅ Evolvable (add needs fields without breaking changes)

Feature 4: Local-First Testing Strategy

Goal: Developers run full Prism stack on laptop with zero cloud dependencies.

Architecture (ADR-004):

# Development workflow
make dev-up # Start Prism proxy + MemStore (in-process, instant)
make test # Run tests against local MemStore
make dev-down # Stop everything

# Integration testing
make integration-up # Start testcontainers (Redis, Postgres, NATS, Kafka)
make integration-test # Run full test suite against real backends
make integration-down # Cleanup

Test Pyramid:

       ┌────────────────┐
│ E2E Tests │ Kubernetes, full stack
│ (10 tests) │ Runtime: 5 minutes
└────────────────┘
┌──────────────────┐
│ Integration Tests│ Testcontainers (Redis, Postgres)
│ (100 tests) │ Runtime: 2 minutes
└──────────────────┘
┌──────────────────────┐
│ Unit Tests │ MemStore (in-memory, no containers)
│ (1000 tests) │ Runtime: 10 seconds
└──────────────────────┘

Backend Substitutability:

Same test suite runs against multiple backends:

// Interface-based acceptance tests
backendDrivers := []BackendSetup{
{Name: "MemStore", Setup: setupMemStore, SupportsTTL: true},
{Name: "Redis", Setup: setupRedis, SupportsTTL: true},
{Name: "Postgres", Setup: setupPostgres, SupportsTTL: false},
}

for _, backend := range backendDrivers {
t.Run(backend.Name, func(t *testing.T) {
driver, cleanup := backend.Setup(t)
defer cleanup()

// Same test code for all backends
testKeyValueBasicOperations(t, driver)
if backend.SupportsTTL {
testKeyValueTTL(t, driver)
}
})
}

Developer Experience:

  • ✅ Unit tests run in <10 seconds (MemStore is instant)
  • ✅ Integration tests run in <2 minutes (testcontainers)
  • ✅ CI/CD fails fast (no waiting for cloud resources)
  • ✅ Deterministic (no flaky tests from network/cloud issues)

Feature 5: Zero-Downtime Migrations

Goal: Change backends without application code changes or service interruptions.

Migration Patterns (ADR-031):

Pattern 1: Dual-Write (Postgres → DynamoDB example)

# Phase 1: Dual-write to both backends
namespace: user-profiles
migration:
strategy: dual-write
primary: postgres # Reads from here
shadow: dynamodb # Writes to both, reads for comparison

# Phase 2: Switch primary (traffic cutover)
namespace: user-profiles
migration:
strategy: dual-write
primary: dynamodb # Reads from here
shadow: postgres # Still writing to both

# Phase 3: Complete migration (remove shadow)
namespace: user-profiles
backend: dynamodb

Observability During Migration:

  • Consistency diff percentage (shadow reads vs primary reads)
  • Latency comparison (primary vs shadow)
  • Error rates per backend
  • Data completeness metrics

Pattern 2: Shadow Traffic (Kafka → NATS example)

# Phase 1: Shadow traffic to new backend
namespace: events
migration:
strategy: shadow
primary: kafka # All production traffic
shadow: nats # Copy of traffic (metrics only)

# Observe: Validate NATS can handle load, latency acceptable

# Phase 2: Percentage cutover
namespace: events
migration:
strategy: percentage
backends:
- nats: 10% # 10% of traffic
- kafka: 90%

# Phase 3: Full cutover
namespace: events
backend: nats

Safety Guarantees:

  • ✅ Automatic rollback on error rate spike
  • ✅ Circuit breaker prevents cascading failures
  • ✅ Data consistency validation before full cutover
  • ✅ Application code unchanged throughout migration

Feature 6: Documentation-First Development

Goal: Design before implementation, preserve decisions permanently.

Workflow (MEMO-003):

┌──────────────────────────────────────────────────────┐
│ 1. Design Phase: Write RFC/ADR with diagrams │
│ - Mermaid sequence diagrams for flows │
│ - Code examples that compile │
│ - Trade-offs explicitly documented │
│ Duration: 1-2 days │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│ 2. Review Phase: Team feedback on design │
│ - Async review via GitHub PR │
│ - Live preview with Docusaurus (instant feedback) │
│ - Iterate on design (not code) │
│ Duration: 2-3 days │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│ 3. Implementation Phase: Code follows design │
│ - RFC is the spec (not implementation detail) │
│ - Tests match documented examples │
│ - Zero design rework │
│ Duration: 5-7 days │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│ 4. Validation Phase: Verify code matches docs │
│ - Link PRs to RFCs │
│ - Update docs if implementation diverged │
│ - Maintain living documentation │
└──────────────────────────────────────────────────────┘

Documentation Types:

TypePurposeExample
ADR (Architecture Decision Record)Why we made significant architectural choicesADR-001: Rust for Proxy
RFC (Request for Comments)Complete technical specification for featuresRFC-010: Admin Protocol
MEMOAnalysis, reviews, process improvementsMEMO-004: Backend Implementation Guide

Micro-CMS Advantage:

Prism uses Docusaurus + GitHub Pages as a "micro-CMS":

  • ✅ Rendered Mermaid diagrams (understand flows instantly)
  • ✅ Syntax-highlighted code examples (copy-paste ready)
  • ✅ Full-text search (find answers in seconds)
  • ✅ Cross-referenced knowledge graph (ADRs ↔ RFCs ↔ MEMOs)
  • ✅ Live preview (see changes in <1 second)
  • ✅ Professional appearance (builds trust with stakeholders)

Impact:

  • Design flaws caught before implementation (cost: 1 hour to fix RFC vs 1 week to refactor code)
  • New team members productive in <1 week (read docs, not code)
  • Decisions preserved permanently (no tribal knowledge loss)

Technical Requirements

Performance Requirements

MetricTargetMeasurement Method
Latency (p50)<1msEnd-to-end client → proxy → backend → client
Latency (p99)<10msExcludes backend latency (measure proxy overhead)
Latency (p99.9)<50msWith load shedding and circuit breaking active
Throughput100K+ RPSSingle proxy instance on 4-core VM
Concurrency10K+ connectionsSimultaneous client connections per proxy
Memory<500MB baselineProxy memory usage at idle
CPU<30% at 50K RPSProxy CPU usage under load

Rationale: Netflix's Java-based gateway achieves 8M QPS across cluster. Prism targets 100K RPS per instance (10-100x more efficient) via Rust's zero-cost abstractions and Tokio async runtime.

Reliability Requirements

RequirementTargetImplementation
Availability99.99% (52 min downtime/year)Multi-region deployment, health checks, auto-restart
Circuit BreakingTrip after 5 consecutive failuresPer-backend circuit breaker, 30s recovery window
Load SheddingShed requests at 90% capacityPriority-based queuing, graceful degradation
Failover<5s to switch to replicaAutomatic health-check-based failover
Data DurabilityZero message loss (Queue pattern)At-least-once delivery, persistent queue backends
ConsistencyConfigurable (eventual → strong)Per-namespace consistency level declaration

Security Requirements

RequirementImplementationReference
AuthenticationOIDC (OAuth2) for admin API, mTLS for data planeRFC-010, ADR-046
AuthorizationNamespace-based RBAC, OPA integrationRFC-011
EncryptionTLS 1.3 for all communicationADR-047
Audit LoggingAll data access logged with user contextRFC-010
PII HandlingAutomatic encryption/masking via proto tagsADR-003
Secrets ManagementHashiCorp Vault integrationPlanned Q2 2026

Observability Requirements

SignalCollection MethodStorageRetention
MetricsOpenTelemetry (Prometheus format)Local Signoz instance (dev), Prometheus (prod)90 days
TracesOpenTelemetry (OTLP)Signoz (dev), Jaeger (prod)30 days
LogsStructured JSON (slog)Signoz (dev), Loki (prod)14 days
Profilespprof (Go plugins), perf (Rust proxy)S3 (long-term)7 days active

Key Metrics to Track:

  • Request rate (RPS) per namespace
  • Latency histogram per namespace per backend
  • Error rate per namespace per backend
  • Backend health (up/down, latency, capacity)
  • Cache hit rate (if caching enabled)
  • Migration progress (dual-write consistency %)

Scalability Requirements

DimensionTargetStrategy
Namespaces1,000+ per proxyNamespace isolation, lightweight routing
Backends100+ unique backend instancesPlugin architecture, lazy loading
Clients10,000+ concurrent clientsConnection pooling, multiplexing
Message SizeUp to 5GB (via Claim Check)Automatic large payload handling
Retention30+ days (streams, queues)Backend-native retention policies

Success Metrics

Product Adoption (Primary Metric)

Goal: 80% of internal microservices use Prism within 12 months of GA

Measurement:

  • Number of namespaces created
  • Number of unique applications using Prism
  • RPS through Prism vs direct backend access
  • % of new services that use Prism (target: 100%)

Milestone Targets:

  • Month 3: 10 early adopters (friendly teams)
  • Month 6: 30% of internal services
  • Month 9: 60% of internal services
  • Month 12: 80% of internal services

Developer Productivity

Goal: 50% reduction in time-to-production for new services

MetricBefore PrismWith PrismMeasurement
Time to Production4-6 weeks1-2 weeksRepo created → first production request
Platform TicketsBaseline50% reductionMonthly support ticket volume
Developer SatisfactionN/A>80% "would recommend"Quarterly survey

Operational Efficiency

Goal: Reduce data access incidents by 50%, MTTR <15 minutes

MetricBaselineTarget (6 months)Measurement
Incidents20/month10/monthData access-related incidents
MTTRVariable<15 minutesMean time to resolution
On-Call PagesBaseline50% reductionBackend-related pages

Migration Velocity

Goal: Enable 3+ backend migrations/year with zero application code changes

PhaseTimelineTargetMeasurement
Phase 120261 migrationRedis → DynamoDB
Phase 22027+3+ migrations/yearSuccessful cutover count
Code ChangesAll phasesZeroLines of application code changed

Performance

Goal: p99 latency <10ms, 100K+ RPS per instance

Measurement:

  • Latency histogram (p50, p95, p99, p99.9)
  • Throughput (RPS) per instance
  • Resource utilization (CPU, memory)

Target:

  • p99 latency: <10ms (excluding backend latency)
  • Throughput: 100K RPS (4-core VM)
  • CPU: <30% at 50K RPS

Release Phases

Phase 0: POC Validation (Q4 2025 - Q1 2026) ✅ In Progress

Goal: Prove core architecture with minimal scope

Deliverables:

  • POC 1: KeyValue pattern with MemStore (2 weeks) → RFC-018
  • POC 2: KeyValue pattern with Redis (2 weeks)
  • POC 3: PubSub pattern with NATS (2 weeks)
  • POC 4: Multicast Registry pattern (3 weeks)
  • POC 5: Authentication (Admin Protocol with OIDC) (2 weeks)

Success Criteria:

  • All POCs demonstrate end-to-end flow (client → proxy → backend)
  • Performance targets met (p99 <10ms, 100K RPS)
  • Tests pass against multiple backends (MemStore, Redis, NATS)

Status: POCs 1-3 completed, POC 4-5 in progress


Phase 1: Alpha Release (Q2 2026)

Goal: Internal dogfooding with friendly teams

Scope:

  • ✅ Layer 1 Primitives: KeyValue, PubSub, Queue
  • ✅ Backends: MemStore, Redis, Postgres, NATS, Kafka
  • ✅ Admin API: Namespace CRUD, health checks, metrics
  • ✅ Client SDKs: Python, Go, Rust
  • ❌ Layer 2 Patterns: Not included (only primitives)
  • ❌ Migrations: Not supported yet

Target Users: 5-10 internal early adopter teams

Success Criteria:

  • 10 namespaces in production
  • 10K+ RPS sustained
  • Zero critical bugs for 2 consecutive weeks
  • Developer feedback: "would recommend" >80%

Risk Mitigation:

  • Feature flags for gradual rollout
  • Shadow traffic only (no primary traffic yet)
  • 24/7 on-call support for early adopters

Phase 2: Beta Release (Q3 2026)

Goal: Production-ready for core use cases

Scope:

  • ✅ All Phase 1 features
  • ✅ Layer 2 Patterns: Multicast Registry, Cache Aside
  • ✅ Migrations: Dual-write pattern
  • ✅ Observability: Full OpenTelemetry integration
  • ✅ Security: OIDC authentication, RBAC authorization
  • ❌ Advanced patterns (Saga, Event Sourcing): Not yet

Target Users: 30% of internal services (~50 services)

Success Criteria:

  • 100+ namespaces in production
  • 500K+ RPS sustained
  • 99.9% availability (month-over-month)
  • 1 successful migration (Redis → DynamoDB)

Marketing:

  • Internal tech talks (bi-weekly)
  • Comprehensive documentation site
  • Getting started guides and templates
  • Office hours (weekly)

Phase 3: GA Release (Q4 2026)

Goal: General availability for all internal teams

Scope:

  • ✅ All Phase 2 features
  • ✅ Layer 2 Patterns: Saga, Event Sourcing, Work Queue
  • ✅ Backends: All planned backends (8 total)
  • ✅ Advanced migrations: Shadow traffic, percentage cutover
  • ✅ Self-service: Namespace creation via GitOps

Target Users: 80% of internal services (~200 services)

Success Criteria:

  • 500+ namespaces in production
  • 5M+ RPS sustained
  • 99.99% availability (quarterly)
  • 3 successful migrations

Support:

  • SLA-backed support (8x5 initially, 24x7 by Q1 2027)
  • Dedicated Slack channel
  • Runbook for common issues
  • Incident response plan

Phase 4: Ecosystem Growth (2027+)

Goal: Become the default data access layer

Scope:

  • ✅ External backends: AWS (DynamoDB, S3, SQS), GCP (Datastore, Pub/Sub)
  • ✅ Community patterns: 3rd-party contributed patterns
  • ✅ Client SDKs: Java, TypeScript, C#
  • ✅ Integrations: Kubernetes Operator, Terraform Provider, Helm Charts

Target Users: 100% of internal services + select external partners

Success Criteria:

  • 1,000+ namespaces
  • 10M+ RPS sustained
  • 99.99% availability (SLA-backed)
  • 5+ community-contributed backend plugins
  • 10+ community-contributed patterns

Ecosystem:

  • Open-source core proxy
  • Plugin marketplace
  • Pattern certification program
  • Annual user conference

Risks and Mitigations

Risk 1: Adoption Resistance (High)

Risk: Teams prefer using backends directly (fear of abstraction overhead)

Mitigation:

  • Prove performance: Publish benchmarks showing <1ms overhead
  • Early wins: Work with friendly teams, showcase success stories
  • Incremental adoption: Allow hybrid (some namespaces via Prism, some direct)
  • Developer experience: Make Prism easier than direct integration (generators, templates)

Ownership: Product Manager + Developer Relations


Risk 2: Performance Bottleneck (Medium)

Risk: Proxy becomes bottleneck at scale (CPU, memory, network)

Mitigation:

  • Rust performance: Leverage zero-cost abstractions, async runtime
  • Benchmarking: Continuous performance regression testing
  • Horizontal scaling: Stateless proxy, easy to scale out
  • Bypass mode: Critical paths can bypass proxy if needed

Ownership: Performance Engineer + SRE


Risk 3: Backend-Specific Features (Medium)

Risk: Teams need backend-specific features not abstracted by Prism

Mitigation:

  • Layer 1 escape hatch: Low-level primitives allow direct control
  • Backend-specific extensions: Optional proto extensions per backend
  • Passthrough mode: Raw query mode for specialized cases
  • Feedback loop: Prioritize frequently requested features

Ownership: Platform Engineer + Product Manager


Risk 4: Migration Complexity (High)

Risk: Dual-write and shadow traffic patterns introduce data consistency issues

Mitigation:

  • Consistency validation: Automated diff detection and alerting
  • Rollback plan: Instant rollback on error rate spike
  • Gradual rollout: Percentage cutover (1% → 10% → 50% → 100%)
  • Dry-run mode: Test migration without impacting production

Ownership: SRE + Database Engineer


Risk 5: Operational Complexity (Medium)

Risk: Prism adds another component to debug, increasing operational burden

Mitigation:

  • Centralized observability: All signals (metrics, traces, logs) in one place
  • Health checks: Automated detection and remediation
  • Runbooks: Comprehensive troubleshooting guides
  • Self-healing: Automatic restarts, circuit breaking, load shedding

Ownership: SRE + DevOps


Open Questions

Question 1: Should Layer 2 Patterns Be Open-Sourced?

Context: Layer 1 (primitives) are generic and reusable. Layer 2 (patterns) may encode internal business logic.

Options:

  • Option A: Open-source all patterns (maximum community value)
  • Option B: Open-source generic patterns only (Multicast Registry, Saga), keep business-specific private
  • Option C: All patterns internal initially, evaluate open-source later

Recommendation: Option B (selective open-source) - generic patterns have broad applicability, business-specific stay internal

Decision Needed By: Q2 2026 (before Beta release)


Question 2: What is the Pricing Model (If External)?

Context: If Prism is offered as managed service to external customers, what pricing model makes sense?

Options:

  • Option A: RPS-based (per million requests)
  • Option B: Namespace-based (per active namespace)
  • Option C: Resource-based (CPU/memory allocation)
  • Option D: Free tier + enterprise support

Recommendation: Start with internal-only (no pricing), evaluate external offering in 2027

Decision Needed By: Q4 2026 (if external offering considered)


Question 3: How Do We Handle Schema Evolution?

Context: Protobuf schemas will evolve (new fields, deprecated methods). How do we maintain compatibility?

Options:

  • Option A: Strict versioning (v1, v2 incompatible)
  • Option B: Backward-compatible only (always additive)
  • Option C: API versioning per namespace (clients pin versions)

Recommendation: Option B + C hybrid (backward-compatible by default, namespaces can pin versions)

Decision Needed By: Q1 2026 (before Alpha)


Appendix

Competitive Landscape

ProductApproachStrengthsWeaknessesDifferentiation
Netflix Data GatewayJVM-based proxyBattle-tested at scaleProprietary, JVM overheadRust performance, local-first testing
AWS AppSyncManaged GraphQLServerless, fully managedAWS-only, GraphQL-specificMulti-cloud, gRPC/HTTP APIs
HasuraGraphQL over PostgresInstant GraphQL APIPostgres-only initiallyMulti-backend, pattern library
Kong / EnvoyAPI GatewayHTTP/gRPC proxyNo data abstractionData-aware patterns (not just routing)
Direct SDKClient librariesNo additional hopTight coupling, hard to migrateLoose coupling, easy migrations

Prism's Unique Value:

  1. Performance: Rust-based, 10-100x better than JVM alternatives
  2. Flexibility: Works with any backend (not locked to AWS/Postgres)
  3. Patterns: High-level abstractions (not just API gateway)
  4. Local-First: Full stack runs on laptop (not just cloud)

References

Netflix Data Gateway:

Prism Architecture:

Design Philosophy:


Revision History

  • 2025-10-12: Initial PRD based on Netflix learnings and Prism architecture memos
  • Future: Updates as product evolves

Approvals

Product Owner: [Name] - Approved [Date]

Engineering Lead: [Name] - Approved [Date]

Architecture Review: [Name] - Approved [Date]