PRD-001: Prism Data Access Gateway
Executive Summary
Prism is a high-performance data access gateway that provides unified APIs for heterogeneous backend datastores, enabling application developers to focus on business logic while the platform handles data access complexity, migrations, and operational concerns.
Inspired by Netflix's Data Gateway, Prism adopts proven patterns from Netflix's 8M+ QPS, 3,500+ use-case platform while improving performance (10-100x via Rust), developer experience (client-originated configuration), and operational simplicity (local-first testing, flexible deployment).
Target Launch: Q2 2026 (Phase 1: POCs completed Q1 2026)
Success Metric: 80% of internal microservices use Prism for data access within 12 months of GA
Product Vision
The Problem: Data Access Complexity at Scale
Modern microservices architectures face growing data access challenges:
- API Fragmentation: Each datastore (Redis, Postgres, Kafka, DynamoDB) has unique APIs, client libraries, and operational requirements
- Migration Complexity: Changing backends requires rewriting application code, extensive testing, and risky deployments
- Distributed Systems Knowledge Gap: Most application developers shouldn't need expertise in consistency models, partitioning, replication, and distributed transactions
- Operational Burden: Each backend requires separate monitoring, capacity planning, security configuration, and disaster recovery
- Pattern Reimplementation: Common patterns (outbox, claim check, sagas) are reimplemented inconsistently across teams
The Solution: Unified Data Access Layer
Prism provides abstraction without compromise:
- Unified APIs: Single set of gRPC/HTTP APIs for KeyValue, PubSub, Queue, TimeSeries, Graph, and Document access patterns
- Backend Agnostic: Application code unchanged when switching from Redis to DynamoDB, or Kafka to NATS
- Semantic Guarantees: Patterns like Multicast Registry coordinate multiple backends atomically
- High Performance: Rust-based proxy achieves sub-millisecond p99 latency even at 100K+ RPS
- Zero-Downtime Migrations: Shadow traffic and dual-write patterns enable gradual backend changes
- Operational Simplicity: Centralized monitoring, security, and capacity management
Strategic Goals
- Accelerate Development: Reduce time-to-production for new services by 50% (eliminate backend integration work)
- Enable Migrations: Support 3+ major backend migrations per year with zero application code changes
- Reduce Operational Cost: Consolidate backend expertise, reduce redundant tooling, optimize resource utilization
- Improve Reliability: Provide battle-tested patterns, circuit breaking, load shedding, and failover built-in
- Foster Innovation: Allow teams to experiment with new backends without rewriting applications
Market Context
Netflix Data Gateway Learnings
Netflix's Data Gateway serves as our primary inspiration:
Scale Achievements:
- 8M+ queries per second (key-value abstraction)
- 10M+ writes per second (time-series data)
- 3,500+ use cases across the organization
- Petabyte-scale storage with low-latency retrieval
Key Lessons Adopted (Netflix Index):
Netflix Lesson | Prism Implementation |
---|---|
Abstraction Simplifies Scale | Layer 1: Primitives (KeyValue, PubSub, Queue, TimeSeries, Graph, Document) |
Prioritize Reliability | Circuit breaking, load shedding, failover built-in (ADR-029) |
Data Management Critical | TTL, lifecycle policies, tiering strategies first-class (RFC-014) |
Sharding for Isolation | Namespace-based isolation, per-tenant deployments (ADR-034) |
Zero-Downtime Migrations | Shadow traffic, dual-write patterns, phased cutover (ADR-031) |
Prism's Improvements Over Netflix
Aspect | Netflix Approach | Prism Enhancement | Benefit |
---|---|---|---|
Proxy Layer | JVM-based gateway | Rust-based (Tokio + Tonic) | 10-100x performance, lower resource usage |
Configuration | Runtime deployment configs | Client-originated (apps declare needs) | Self-service, reduced ops toil |
Testing | Production-validated | Local-first (sqlite, testcontainers) | Fast feedback, deterministic tests |
Deployment | Kubernetes-native | Flexible (bare metal, VMs, containers) | Simpler operations, lower cost |
Documentation | Internal wiki | Documentation-first (ADRs, RFCs, micro-CMS) | Faster onboarding, knowledge preservation |
User Personas
Primary: Application Developer (Backend Engineer)
Profile: Mid-level engineer building microservices, 2-5 years experience, proficient in one language (Go, Python, Rust, Java)
Goals:
- Build features quickly without learning distributed systems internals
- Use familiar patterns (REST APIs, pub/sub messaging) without backend-specific knowledge
- Deploy code confidently without breaking production
- Understand system behavior when things go wrong
Pain Points:
- Overwhelmed by backend options (Redis, Postgres, Kafka, DynamoDB, Cassandra...)
- Spending weeks integrating with each new datastore
- Fear of making wrong architectural decisions early
- Debugging distributed systems issues without proper training
Prism Value:
- ✅ Single API for all data access (learn once, use everywhere)
- ✅ Start with MemStore (in-memory), migrate to Redis/Postgres later without code changes
- ✅ Pattern library provides proven solutions (Multicast Registry, Saga, Event Sourcing)
- ✅ Rich error messages and observability built-in
Success Metric: Time from "new service" to "production" < 1 week
Secondary: Platform Engineer (Infrastructure Team)
Profile: Senior engineer responsible for platform services, 5-10 years experience, deep distributed systems knowledge
Goals:
- Provide self-service capabilities to application teams
- Maintain platform stability (high availability, low latency)
- Manage cost and capacity efficiently
- Enable safe migrations and experiments
Pain Points:
- Supporting N different datastores with N different operational models
- Manual work for every new namespace or backend instance
- Risk of cascading failures from misbehaving applications
- Difficult to enforce best practices (circuit breaking, retries, timeouts)
Prism Value:
- ✅ Centralized observability and operational controls
- ✅ Policy enforcement (rate limiting, access control, data governance)
- ✅ Self-service namespace creation via declarative config
- ✅ Backend substitutability (migrate Redis → DynamoDB transparently)
Success Metric: Operational incidents reduced by 50%, MTTR < 15 minutes
Tertiary: Data Engineer / Analyst
Profile: Specialist working with analytics, ML pipelines, or data warehousing
Goals:
- Access production data for analytics safely
- Build ETL pipelines without impacting production services
- Integrate with existing analytics tools (Spark, Airflow, Snowflake)
Pain Points:
- Direct database access risks impacting production
- Inconsistent data formats across microservices
- Difficult to maintain data lineage and quality
Prism Value:
- ✅ Read-only replicas and Change Data Capture (CDC) support
- ✅ TimeSeries abstraction for metrics and event logs
- ✅ Graph abstraction for relationship queries
- ✅ Audit trails and data provenance built-in
Success Metric: Analytics queries don't impact production latency
Core Features
Feature 1: Layered API Architecture
Layer 1: Primitives (Always Available)
Six foundational abstractions that compose to solve 80% of use cases:
Primitive | Purpose | Backend Examples | RFC Reference |
---|---|---|---|
KeyValue | Simple storage | Redis, DynamoDB, etcd, Postgres, MemStore | RFC-014 |
PubSub | Fire-and-forget messaging | NATS, Redis, Kafka (as topic) | RFC-014 |
Queue | Work distribution | SQS, Postgres, RabbitMQ | RFC-014 |
Stream | Ordered event log | Kafka, NATS JetStream, Redis Streams | RFC-014 |
TimeSeries | Temporal data | ClickHouse, TimescaleDB, Prometheus | RFC-014 |
Graph | Relationships | Neptune, Neo4j, Postgres (recursive CTEs) | RFC-014 |
Example Usage (Primitives):
from prism import Client
client = Client(endpoint="localhost:8080")
# KeyValue: Simple storage
await client.keyvalue.set("user:123", user_data, ttl=3600)
user = await client.keyvalue.get("user:123")
# PubSub: Broadcast events
await client.pubsub.publish("user-events", event_data)
async for event in client.pubsub.subscribe("user-events"):
process(event)
# Queue: Background jobs
await client.queue.enqueue("email-jobs", email_task)
task = await client.queue.dequeue("email-jobs", visibility_timeout=30)
Layer 2: Patterns (Use-Case-Specific, Opt-In)
Purpose-built patterns that coordinate multiple backends for common use cases:
Pattern | Solves | Composes | RFC Reference |
---|---|---|---|
Multicast Registry | Device management, presence, service discovery | KeyValue + PubSub + Queue | RFC-017 |
Saga | Distributed transactions | KeyValue + Queue + Compensation | Planned Q2 2026 |
Event Sourcing | Audit trails, event replay | Stream + KeyValue + Snapshots | Planned Q2 2026 |
Cache Aside | Read-through caching | KeyValue (cache) + KeyValue (db) | Planned Q3 2026 |
Outbox | Transactional messaging | KeyValue (tx) + Queue + WAL | Planned Q3 2026 |
Example Usage (Patterns):
# Multicast Registry: IoT device management
registry = client.multicast_registry("iot-devices")
# Register device with metadata
await registry.register(
identity="device-sensor-001",
metadata={"type": "temperature", "location": "building-a", "floor": 3}
)
# Enumerate matching devices
devices = await registry.enumerate(filter={"location": "building-a"})
# Multicast command to filtered subset
result = await registry.multicast(
filter={"type": "temperature", "floor": 3},
message={"command": "read", "sample_rate": 5}
)
print(f"Delivered to {result.success_count}/{result.total_count} devices")
Why Layered? (MEMO-005)
- Layer 1 for power users who need full control and novel compositions
- Layer 2 for most developers who want ergonomic, self-documenting APIs
- Choice based on team expertise and use case requirements
User Persona Mapping:
- Application Developers: Primarily Layer 2 (80% of use cases)
- Platform Engineers: Both layers (Layer 1 for infrastructure, Layer 2 for application teams)
- Advanced Users: Layer 1 for custom patterns
Feature 2: Backend Plugin Architecture
Goal: Support 10+ backends without bloating core proxy
Architecture (RFC-008):
┌──────────────────────────────────────────────────────┐
│ Prism Proxy (Rust Core) │
│ ┌────────────────────────────────────────────────┐ │
│ │ Layer 1 API: KeyValue, PubSub, Queue, etc. │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ ↓ gRPC │
│ ┌────────────────────────────────────────────────┐ │
│ │ Namespace Router (config-driven) │ │
│ └────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
│
┌────────────┼────────────┐
↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Redis │ │ Postgres│ │ Kafka │
│ Plugin │ │ Plugin │ │ Plugin │
│ (Go) │ │ (Go) │ │ (Go) │
└─────────┘ └─────────┘ └─────────┘
Backend Interface Decomposition (MEMO-006):
Instead of monolithic "Redis backend", each backend advertises thin interfaces:
# Redis implements 24 interfaces
backend: redis
implements:
- keyvalue_basic # Set, Get, Delete, Exists
- keyvalue_scan # Scan, ScanKeys, Count
- keyvalue_ttl # Expire, GetTTL, Persist
- keyvalue_transactional # MULTI/EXEC
- keyvalue_batch # MGET, MSET
- pubsub_basic # Publish, Subscribe
- pubsub_wildcards # Pattern matching
- stream_basic # XADD, XREAD
- stream_consumer_groups # XGROUP, XREADGROUP
- stream_replay # XREAD from offset
# ...and 14 more (Lists, Sets, SortedSets)
Pattern Slot Matching:
Patterns declare required interfaces for each slot, proxy validates at config time:
pattern: multicast-registry
slots:
registry:
required: [keyvalue_basic, keyvalue_scan]
optional: [keyvalue_ttl]
recommended: [redis, postgres, dynamodb, etcd]
messaging:
required: [pubsub_basic]
optional: [pubsub_persistent]
recommended: [nats, kafka, redis]
Backend Priority (MEMO-004):
Phase 1 (Internal Priorities): 0. MemStore (In-memory Go map) - Score: 100/100 - Zero dependencies, instant startup
- Kafka - Score: 78/100 - Internal event streaming
- NATS - Score: 90/100 - Internal pub/sub messaging
- PostgreSQL - Score: 93/100 - Internal relational data
- Neptune - Score: 50/100 - Internal graph data
Phase 2 (External/Supporting): 5. Redis - Score: 95/100 - General caching 6. SQLite - Score: 92/100 - Embedded testing 7. S3/MinIO - Score: 85/100 - Large payload handling 8. ClickHouse - Score: 70/100 - Analytics
Feature 3: Client-Originated Configuration
Problem: Traditional approaches require ops teams to provision infrastructure before developers can code.
Prism Approach: Application declares requirements, platform provisions automatically.
Configuration Format:
# Application: prism.yaml (committed to app repo)
namespaces:
- name: user-sessions
pattern: keyvalue
needs:
latency: p99 < 10ms
throughput: 50K rps
ttl: required
persistence: optional # Can survive restarts
backend:
type: redis # Explicit choice
# OR
auto: true # Platform selects best match
- name: notification-queue
pattern: queue
needs:
visibility_timeout: 30s
dead_letter: true
throughput: 10K enqueues/sec
backend:
type: postgres # Using Postgres as queue (SKIP LOCKED pattern)
Platform Workflow:
- Deploy: Application pushes config to Prism control plane
- Validate: Proxy validates requirements are satisfiable
- Provision: Backends auto-provisioned (or mapped to existing)
- Observe: Namespace metrics tracked, capacity adjusted automatically
Benefits:
- ✅ Self-service (no ops ticket required)
- ✅ Version controlled (infrastructure as code in app repo)
- ✅ Testable (use MemStore in dev, Redis in production)
- ✅ Evolvable (add
needs
fields without breaking changes)
Feature 4: Local-First Testing Strategy
Goal: Developers run full Prism stack on laptop with zero cloud dependencies.
Architecture (ADR-004):
# Development workflow
make dev-up # Start Prism proxy + MemStore (in-process, instant)
make test # Run tests against local MemStore
make dev-down # Stop everything
# Integration testing
make integration-up # Start testcontainers (Redis, Postgres, NATS, Kafka)
make integration-test # Run full test suite against real backends
make integration-down # Cleanup
Test Pyramid:
┌────────────────┐
│ E2E Tests │ Kubernetes, full stack
│ (10 tests) │ Runtime: 5 minutes
└────────────────┘
┌──────────────────┐
│ Integration Tests│ Testcontainers (Redis, Postgres)
│ (100 tests) │ Runtime: 2 minutes
└──────────────────┘
┌──────────────────────┐
│ Unit Tests │ MemStore (in-memory, no containers)
│ (1000 tests) │ Runtime: 10 seconds
└──────────────────────┘
Backend Substitutability:
Same test suite runs against multiple backends:
// Interface-based acceptance tests
backendDrivers := []BackendSetup{
{Name: "MemStore", Setup: setupMemStore, SupportsTTL: true},
{Name: "Redis", Setup: setupRedis, SupportsTTL: true},
{Name: "Postgres", Setup: setupPostgres, SupportsTTL: false},
}
for _, backend := range backendDrivers {
t.Run(backend.Name, func(t *testing.T) {
driver, cleanup := backend.Setup(t)
defer cleanup()
// Same test code for all backends
testKeyValueBasicOperations(t, driver)
if backend.SupportsTTL {
testKeyValueTTL(t, driver)
}
})
}
Developer Experience:
- ✅ Unit tests run in <10 seconds (MemStore is instant)
- ✅ Integration tests run in <2 minutes (testcontainers)
- ✅ CI/CD fails fast (no waiting for cloud resources)
- ✅ Deterministic (no flaky tests from network/cloud issues)
Feature 5: Zero-Downtime Migrations
Goal: Change backends without application code changes or service interruptions.
Migration Patterns (ADR-031):
Pattern 1: Dual-Write (Postgres → DynamoDB example)
# Phase 1: Dual-write to both backends
namespace: user-profiles
migration:
strategy: dual-write
primary: postgres # Reads from here
shadow: dynamodb # Writes to both, reads for comparison
# Phase 2: Switch primary (traffic cutover)
namespace: user-profiles
migration:
strategy: dual-write
primary: dynamodb # Reads from here
shadow: postgres # Still writing to both
# Phase 3: Complete migration (remove shadow)
namespace: user-profiles
backend: dynamodb
Observability During Migration:
- Consistency diff percentage (shadow reads vs primary reads)
- Latency comparison (primary vs shadow)
- Error rates per backend
- Data completeness metrics
Pattern 2: Shadow Traffic (Kafka → NATS example)
# Phase 1: Shadow traffic to new backend
namespace: events
migration:
strategy: shadow
primary: kafka # All production traffic
shadow: nats # Copy of traffic (metrics only)
# Observe: Validate NATS can handle load, latency acceptable
# Phase 2: Percentage cutover
namespace: events
migration:
strategy: percentage
backends:
- nats: 10% # 10% of traffic
- kafka: 90%
# Phase 3: Full cutover
namespace: events
backend: nats
Safety Guarantees:
- ✅ Automatic rollback on error rate spike
- ✅ Circuit breaker prevents cascading failures
- ✅ Data consistency validation before full cutover
- ✅ Application code unchanged throughout migration
Feature 6: Documentation-First Development
Goal: Design before implementation, preserve decisions permanently.
Workflow (MEMO-003):
┌──────────────────────────────────────────────────────┐
│ 1. Design Phase: Write RFC/ADR with diagrams │
│ - Mermaid sequence diagrams for flows │
│ - Code examples that compile │
│ - Trade-offs explicitly documented │
│ Duration: 1-2 days │
└──────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────┐
│ 2. Review Phase: Team feedback on design │
│ - Async review via GitHub PR │
│ - Live preview with Docusaurus (instant feedback) │
│ - Iterate on design (not code) │
│ Duration: 2-3 days │
└──────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────┐
│ 3. Implementation Phase: Code follows design │
│ - RFC is the spec (not implementation detail) │
│ - Tests match documented examples │
│ - Zero design rework │
│ Duration: 5-7 days │
└──────────────────────── ──────────────────────────────┘
↓
┌──────────────────────────────────────────────────────┐
│ 4. Validation Phase: Verify code matches docs │
│ - Link PRs to RFCs │
│ - Update docs if implementation diverged │
│ - Maintain living documentation │
└──────────────────────────────────────────────────────┘
Documentation Types:
Type | Purpose | Example |
---|---|---|
ADR (Architecture Decision Record) | Why we made significant architectural choices | ADR-001: Rust for Proxy |
RFC (Request for Comments) | Complete technical specification for features | RFC-010: Admin Protocol |
MEMO | Analysis, reviews, process improvements | MEMO-004: Backend Implementation Guide |
Micro-CMS Advantage:
Prism uses Docusaurus + GitHub Pages as a "micro-CMS":
- ✅ Rendered Mermaid diagrams (understand flows instantly)
- ✅ Syntax-highlighted code examples (copy-paste ready)
- ✅ Full-text search (find answers in seconds)
- ✅ Cross-referenced knowledge graph (ADRs ↔ RFCs ↔ MEMOs)
- ✅ Live preview (see changes in <1 second)
- ✅ Professional appearance (builds trust with stakeholders)
Impact:
- Design flaws caught before implementation (cost: 1 hour to fix RFC vs 1 week to refactor code)
- New team members productive in <1 week (read docs, not code)
- Decisions preserved permanently (no tribal knowledge loss)
Technical Requirements
Performance Requirements
Metric | Target | Measurement Method |
---|---|---|
Latency (p50) | <1ms | End-to-end client → proxy → backend → client |
Latency (p99) | <10ms | Excludes backend latency (measure proxy overhead) |
Latency (p99.9) | <50ms | With load shedding and circuit breaking active |
Throughput | 100K+ RPS | Single proxy instance on 4-core VM |
Concurrency | 10K+ connections | Simultaneous client connections per proxy |
Memory | <500MB baseline | Proxy memory usage at idle |
CPU | <30% at 50K RPS | Proxy CPU usage under load |
Rationale: Netflix's Java-based gateway achieves 8M QPS across cluster. Prism targets 100K RPS per instance (10-100x more efficient) via Rust's zero-cost abstractions and Tokio async runtime.
Reliability Requirements
Requirement | Target | Implementation |
---|---|---|
Availability | 99.99% (52 min downtime/year) | Multi-region deployment, health checks, auto-restart |
Circuit Breaking | Trip after 5 consecutive failures | Per-backend circuit breaker, 30s recovery window |
Load Shedding | Shed requests at 90% capacity | Priority-based queuing, graceful degradation |
Failover | <5s to switch to replica | Automatic health-check-based failover |
Data Durability | Zero message loss (Queue pattern) | At-least-once delivery, persistent queue backends |
Consistency | Configurable (eventual → strong) | Per-namespace consistency level declaration |
Security Requirements
Requirement | Implementation | Reference |
---|---|---|
Authentication | OIDC (OAuth2) for admin API, mTLS for data plane | RFC-010, ADR-046 |
Authorization | Namespace-based RBAC, OPA integration | RFC-011 |
Encryption | TLS 1.3 for all communication | ADR-047 |
Audit Logging | All data access logged with user context | RFC-010 |
PII Handling | Automatic encryption/masking via proto tags | ADR-003 |
Secrets Management | HashiCorp Vault integration | Planned Q2 2026 |
Observability Requirements
Signal | Collection Method | Storage | Retention |
---|---|---|---|
Metrics | OpenTelemetry (Prometheus format) | Local Signoz instance (dev), Prometheus (prod) | 90 days |
Traces | OpenTelemetry (OTLP) | Signoz (dev), Jaeger (prod) | 30 days |
Logs | Structured JSON (slog) | Signoz (dev), Loki (prod) | 14 days |
Profiles | pprof (Go plugins), perf (Rust proxy) | S3 (long-term) | 7 days active |
Key Metrics to Track:
- Request rate (RPS) per namespace
- Latency histogram per namespace per backend
- Error rate per namespace per backend
- Backend health (up/down, latency, capacity)
- Cache hit rate (if caching enabled)
- Migration progress (dual-write consistency %)
Scalability Requirements
Dimension | Target | Strategy |
---|---|---|
Namespaces | 1,000+ per proxy | Namespace isolation, lightweight routing |
Backends | 100+ unique backend instances | Plugin architecture, lazy loading |
Clients | 10,000+ concurrent clients | Connection pooling, multiplexing |
Message Size | Up to 5GB (via Claim Check) | Automatic large payload handling |
Retention | 30+ days (streams, queues) | Backend-native retention policies |
Success Metrics
Product Adoption (Primary Metric)
Goal: 80% of internal microservices use Prism within 12 months of GA
Measurement:
- Number of namespaces created
- Number of unique applications using Prism
- RPS through Prism vs direct backend access
- % of new services that use Prism (target: 100%)
Milestone Targets:
- Month 3: 10 early adopters (friendly teams)
- Month 6: 30% of internal services
- Month 9: 60% of internal services
- Month 12: 80% of internal services
Developer Productivity
Goal: 50% reduction in time-to-production for new services
Metric | Before Prism | With Prism | Measurement |
---|---|---|---|
Time to Production | 4-6 weeks | 1-2 weeks | Repo created → first production request |
Platform Tickets | Baseline | 50% reduction | Monthly support ticket volume |
Developer Satisfaction | N/A | >80% "would recommend" | Quarterly survey |
Operational Efficiency
Goal: Reduce data access incidents by 50%, MTTR <15 minutes
Metric | Baseline | Target (6 months) | Measurement |
---|---|---|---|
Incidents | 20/month | 10/month | Data access-related incidents |
MTTR | Variable | <15 minutes | Mean time to resolution |
On-Call Pages | Baseline | 50% reduction | Backend-related pages |
Migration Velocity
Goal: Enable 3+ backend migrations/year with zero application code changes
Phase | Timeline | Target | Measurement |
---|---|---|---|
Phase 1 | 2026 | 1 migration | Redis → DynamoDB |
Phase 2 | 2027+ | 3+ migrations/year | Successful cutover count |
Code Changes | All phases | Zero | Lines of application code changed |
Performance
Goal: p99 latency <10ms, 100K+ RPS per instance
Measurement:
- Latency histogram (p50, p95, p99, p99.9)
- Throughput (RPS) per instance
- Resource utilization (CPU, memory)
Target:
- p99 latency: <10ms (excluding backend latency)
- Throughput: 100K RPS (4-core VM)
- CPU: <30% at 50K RPS
Release Phases
Phase 0: POC Validation (Q4 2025 - Q1 2026) ✅ In Progress
Goal: Prove core architecture with minimal scope
Deliverables:
- POC 1: KeyValue pattern with MemStore (2 weeks) → RFC-018
- POC 2: KeyValue pattern with Redis (2 weeks)
- POC 3: PubSub pattern with NATS (2 weeks)
- POC 4: Multicast Registry pattern (3 weeks)
- POC 5: Authentication (Admin Protocol with OIDC) (2 weeks)
Success Criteria:
- All POCs demonstrate end-to-end flow (client → proxy → backend)
- Performance targets met (p99 <10ms, 100K RPS)
- Tests pass against multiple backends (MemStore, Redis, NATS)
Status: POCs 1-3 completed, POC 4-5 in progress
Phase 1: Alpha Release (Q2 2026)
Goal: Internal dogfooding with friendly teams
Scope:
- ✅ Layer 1 Primitives: KeyValue, PubSub, Queue
- ✅ Backends: MemStore, Redis, Postgres, NATS, Kafka
- ✅ Admin API: Namespace CRUD, health checks, metrics
- ✅ Client SDKs: Python, Go, Rust
- ❌ Layer 2 Patterns: Not included (only primitives)
- ❌ Migrations: Not supported yet
Target Users: 5-10 internal early adopter teams
Success Criteria:
- 10 namespaces in production
- 10K+ RPS sustained
- Zero critical bugs for 2 consecutive weeks
- Developer feedback: "would recommend" >80%
Risk Mitigation:
- Feature flags for gradual rollout
- Shadow traffic only (no primary traffic yet)
- 24/7 on-call support for early adopters
Phase 2: Beta Release (Q3 2026)
Goal: Production-ready for core use cases
Scope:
- ✅ All Phase 1 features
- ✅ Layer 2 Patterns: Multicast Registry, Cache Aside
- ✅ Migrations: Dual-write pattern
- ✅ Observability: Full OpenTelemetry integration
- ✅ Security: OIDC authentication, RBAC authorization
- ❌ Advanced patterns (Saga, Event Sourcing): Not yet
Target Users: 30% of internal services (~50 services)
Success Criteria:
- 100+ namespaces in production
- 500K+ RPS sustained
- 99.9% availability (month-over-month)
- 1 successful migration (Redis → DynamoDB)
Marketing:
- Internal tech talks (bi-weekly)
- Comprehensive documentation site
- Getting started guides and templates
- Office hours (weekly)
Phase 3: GA Release (Q4 2026)
Goal: General availability for all internal teams
Scope:
- ✅ All Phase 2 features
- ✅ Layer 2 Patterns: Saga, Event Sourcing, Work Queue
- ✅ Backends: All planned backends (8 total)
- ✅ Advanced migrations: Shadow traffic, percentage cutover
- ✅ Self-service: Namespace creation via GitOps
Target Users: 80% of internal services (~200 services)
Success Criteria:
- 500+ namespaces in production
- 5M+ RPS sustained
- 99.99% availability (quarterly)
- 3 successful migrations
Support:
- SLA-backed support (8x5 initially, 24x7 by Q1 2027)
- Dedicated Slack channel
- Runbook for common issues
- Incident response plan
Phase 4: Ecosystem Growth (2027+)
Goal: Become the default data access layer
Scope:
- ✅ External backends: AWS (DynamoDB, S3, SQS), GCP (Datastore, Pub/Sub)
- ✅ Community patterns: 3rd-party contributed patterns
- ✅ Client SDKs: Java, TypeScript, C#
- ✅ Integrations: Kubernetes Operator, Terraform Provider, Helm Charts
Target Users: 100% of internal services + select external partners
Success Criteria:
- 1,000+ namespaces
- 10M+ RPS sustained
- 99.99% availability (SLA-backed)
- 5+ community-contributed backend plugins
- 10+ community-contributed patterns
Ecosystem:
- Open-source core proxy
- Plugin marketplace
- Pattern certification program
- Annual user conference
Risks and Mitigations
Risk 1: Adoption Resistance (High)
Risk: Teams prefer using backends directly (fear of abstraction overhead)
Mitigation:
- ✅ Prove performance: Publish benchmarks showing <1ms overhead
- ✅ Early wins: Work with friendly teams, showcase success stories
- ✅ Incremental adoption: Allow hybrid (some namespaces via Prism, some direct)
- ✅ Developer experience: Make Prism easier than direct integration (generators, templates)
Ownership: Product Manager + Developer Relations
Risk 2: Performance Bottleneck (Medium)
Risk: Proxy becomes bottleneck at scale (CPU, memory, network)
Mitigation:
- ✅ Rust performance: Leverage zero-cost abstractions, async runtime
- ✅ Benchmarking: Continuous performance regression testing
- ✅ Horizontal scaling: Stateless proxy, easy to scale out
- ✅ Bypass mode: Critical paths can bypass proxy if needed
Ownership: Performance Engineer + SRE
Risk 3: Backend-Specific Features (Medium)
Risk: Teams need backend-specific features not abstracted by Prism
Mitigation:
- ✅ Layer 1 escape hatch: Low-level primitives allow direct control
- ✅ Backend-specific extensions: Optional proto extensions per backend
- ✅ Passthrough mode: Raw query mode for specialized cases
- ✅ Feedback loop: Prioritize frequently requested features
Ownership: Platform Engineer + Product Manager
Risk 4: Migration Complexity (High)
Risk: Dual-write and shadow traffic patterns introduce data consistency issues
Mitigation:
- ✅ Consistency validation: Automated diff detection and alerting
- ✅ Rollback plan: Instant rollback on error rate spike
- ✅ Gradual rollout: Percentage cutover (1% → 10% → 50% → 100%)
- ✅ Dry-run mode: Test migration without impacting production
Ownership: SRE + Database Engineer
Risk 5: Operational Complexity (Medium)
Risk: Prism adds another component to debug, increasing operational burden
Mitigation:
- ✅ Centralized observability: All signals (metrics, traces, logs) in one place
- ✅ Health checks: Automated detection and remediation
- ✅ Runbooks: Comprehensive troubleshooting guides
- ✅ Self-healing: Automatic restarts, circuit breaking, load shedding
Ownership: SRE + DevOps
Open Questions
Question 1: Should Layer 2 Patterns Be Open-Sourced?
Context: Layer 1 (primitives) are generic and reusable. Layer 2 (patterns) may encode internal business logic.
Options:
- Option A: Open-source all patterns (maximum community value)
- Option B: Open-source generic patterns only (Multicast Registry, Saga), keep business-specific private
- Option C: All patterns internal initially, evaluate open-source later
Recommendation: Option B (selective open-source) - generic patterns have broad applicability, business-specific stay internal
Decision Needed By: Q2 2026 (before Beta release)
Question 2: What is the Pricing Model (If External)?
Context: If Prism is offered as managed service to external customers, what pricing model makes sense?
Options:
- Option A: RPS-based (per million requests)
- Option B: Namespace-based (per active namespace)
- Option C: Resource-based (CPU/memory allocation)
- Option D: Free tier + enterprise support
Recommendation: Start with internal-only (no pricing), evaluate external offering in 2027
Decision Needed By: Q4 2026 (if external offering considered)
Question 3: How Do We Handle Schema Evolution?
Context: Protobuf schemas will evolve (new fields, deprecated methods). How do we maintain compatibility?
Options:
- Option A: Strict versioning (v1, v2 incompatible)
- Option B: Backward-compatible only (always additive)
- Option C: API versioning per namespace (clients pin versions)
Recommendation: Option B + C hybrid (backward-compatible by default, namespaces can pin versions)
Decision Needed By: Q1 2026 (before Alpha)
Appendix
Competitive Landscape
Product | Approach | Strengths | Weaknesses | Differentiation |
---|---|---|---|---|
Netflix Data Gateway | JVM-based proxy | Battle-tested at scale | Proprietary, JVM overhead | Rust performance, local-first testing |
AWS AppSync | Managed GraphQL | Serverless, fully managed | AWS-only, GraphQL-specific | Multi-cloud, gRPC/HTTP APIs |
Hasura | GraphQL over Postgres | Instant GraphQL API | Postgres-only initially | Multi-backend, pattern library |
Kong / Envoy | API Gateway | HTTP/gRPC proxy | No data abstraction | Data-aware patterns (not just routing) |
Direct SDK | Client libraries | No additional hop | Tight coupling, hard to migrate | Loose coupling, easy migrations |
Prism's Unique Value:
- Performance: Rust-based, 10-100x better than JVM alternatives
- Flexibility: Works with any backend (not locked to AWS/Postgres)
- Patterns: High-level abstractions (not just API gateway)
- Local-First: Full stack runs on laptop (not just cloud)
References
Netflix Data Gateway:
- Netflix Index - Overview and key learnings
- Netflix Summary - Lessons learned
- Netflix Abstractions - Data models (KeyValue, TimeSeries, Counter, WAL)
- Netflix Key Use Cases - Real-world applications
Prism Architecture:
- ADR-001: Rust for Proxy - Why Rust over Go/Java
- RFC-008: Proxy Plugin Architecture - Backend plugin system
- RFC-014: Layered Data Access Patterns - Layer 1 primitives
- RFC-017: Multicast Registry Pattern - First Layer 2 pattern
- RFC-018: POC Implementation Strategy - Phased rollout plan
Design Philosophy:
- MEMO-003: Documentation-First Development - Design before code
- MEMO-004: Backend Plugin Implementation Guide - Backend priorities
- MEMO-005: Client Protocol Design Philosophy - Layered API architecture
- MEMO-006: Backend Interface Decomposition - Schema registry
Revision History
- 2025-10-12: Initial PRD based on Netflix learnings and Prism architecture memos
- Future: Updates as product evolves
Approvals
Product Owner: [Name] - Approved [Date]
Engineering Lead: [Name] - Approved [Date]
Architecture Review: [Name] - Approved [Date]