RFC-030: Schema Evolution and Validation for Decoupled Pub/Sub
Abstract
This RFC addresses schema evolution and validation for publisher/consumer patterns in Prism where producers and consumers are decoupled across async teams with different workflows and GitHub repositories. It proposes a schema registry approach that enables producers to declare publish schemas (GitHub or dedicated registry), consumers to validate compatibility at runtime, and platform teams to enforce governance while maintaining development velocity.
Motivation
The Decoupling Problem
Prism's pub/sub and queue patterns intentionally decouple producers from consumers:
Current Architecture:
┌─────────────────┐ ┌─────────────────┐
│ Producer App │ │ Consumer App │
│ (Team A, Repo 1) │ │ (Team B, Repo 2)│
└────────┬────────┘ └────────┬────────┘
│ │
│ Publish │ Subscribe
│ events │ events
└───────────┐ ┌─────────┘
▼ ▼
┌──────────────────┐
│ Prism Proxy │
│ NATS/Kafka │
└──────────────────┘
Problems This Creates:
-
Schema Discovery: Consumer teams don't know what schema producers use
- No centralized documentation
- Tribal knowledge or Slack asks: "Hey, what fields does
user.created
have?" - Breaking changes discovered at runtime
-
Version Mismatches: Producer evolves schema, consumer breaks
- Producer adds required field → consumers crash on deserialization
- Producer removes field → consumers get
null
unexpectedly - Producer changes field type → silent data corruption
-
Cross-Repo Workflows: Teams can't coordinate deploys
- Producer Team A deploys v2 schema on Monday
- Consumer Team B still running v1 code on Friday
- No visibility into downstream breakage
-
Testing Challenges: Consumers can't test against producer changes
- Integration tests use mock data
- Mocks drift from real schemas
- Production is first place incompatibility detected
-
Governance Vacuum: No platform control over data quality
- No PII tagging enforcement
- No backward compatibility checks
- No schema approval workflows
Why This Matters for PRD-001 Goals
PRD-001 Core Goals This Blocks:
Goal | Blocked By | Impact |
---|---|---|
Accelerate Development | Waiting for schema docs from other teams | Delays feature delivery |
Enable Migrations | Can't validate consumers before backend change | Risky migrations |
Reduce Operational Cost | Runtime failures from schema mismatches | Incident toil |
Improve Reliability | Silent data corruption from type changes | Data quality issues |
Foster Innovation | Fear of breaking downstream consumers | Slows experimentation |
Real-World Scenarios
Scenario 1: E-Commerce Order Events
Producer: Order Service (Team A)
- Publishes: orders.created
- Schema: {order_id, user_id, items[], total, currency}
Consumers:
- Fulfillment Service (Team B): Needs order_id, items[]
- Analytics Pipeline (Team C): Needs all fields
- Email Service (Team D): Needs order_id, user_id, total
Problem: Team A wants to add `tax_amount` field (required)
- How do they know which consumers will break?
- How do consumers discover this change before deploy?
- What happens if Team D deploys before Team A?
Scenario 2: IoT Sensor Data
Producer: IoT Gateway (Team A)
- Publishes: sensor.readings
- Schema: {sensor_id, timestamp, temperature, humidity}
Consumers:
- Alerting Service (Team B): Needs sensor_id, temperature
- Data Lake (Team C): Needs all fields
- Dashboard (Team D): Needs sensor_id, timestamp, temperature
Problem: Team A changes `temperature` from int (Celsius) to float (Fahrenheit)
- Type change breaks deserialization
- Semantic change breaks business logic
- How to test this without breaking production?
Scenario 3: User Profile Updates
Producer: User Service (Team A)
- Publishes: user.profile.updated
- Schema: {user_id, email, name, avatar_url}
- Contains PII: email, name
Consumer: Search Indexer (Team B)
- Stores ALL fields in Elasticsearch (public-facing search)
Problem: PII leak due to missing governance
- Producer doesn't tag PII fields
- Consumer indexes email addresses
- Compliance violation, data breach risk
Goals
- Schema Discovery: Consumers can find producer schemas without asking humans
- Compatibility Validation: Consumers detect breaking changes before deploy
- Decoupled Evolution: Producers evolve schemas without coordinating deploys
- Testing Support: Consumers test against real schemas in CI/CD
- Governance Enforcement: Platform enforces PII tagging, compatibility rules
- Developer Velocity: Schema changes take minutes, not days of coordination
Non-Goals
- Runtime Schema Transformation: No automatic v1 → v2 translation (use separate topics)
- Cross-Language Type System: Won't solve Go struct ↔ Python dict ↔ Rust enum mapping
- Schema Inference: Won't auto-generate schemas from published data
- Global Schema Uniqueness: Same event type can have different schemas per namespace
- Zero Downtime Schema Migration: Producers/consumers must handle overlapping schema versions
Proposed Solution: Layered Schema Registry
Architecture Overview
┌────────────────────────────────────────────────────────────┐
│ Producer Workflow │
├────────────────────────────────────────────────────────────┤
│ │
│ 1. Define Schema (protobuf/json-schema/avro) │
│ ├─ orders.created.v2.proto │
│ ├─ PII tags: @prism.pii(type="email") │
│ └─ Backward compat: optional new fields │
│ │
│ 2. Register Schema │
│ ├─ Option A: Push to GitHub (git tag release) │
│ ├─ Option B: POST to Prism Schema Registry │
│ └─ CI/CD validates compat │
│ │
│ 3. Publish with Schema Reference │
│ client.publish(topic="orders.created", payload=data, │
│ schema_url="github.com/.../v2.proto") │
│ │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ Consumer Workflow │
├────────────────────────────────────────────────────────────┤
│ │
│ 1. Discover Schema │
│ ├─ List available schemas for topic │
│ ├─ GET github.com/.../orders.created.v2.proto │
│ └─ Generate client code (protoc) │
│ │
│ 2. Validate Compatibility (CI/CD) │
│ ├─ prism schema check --consumer my-schema.proto │
│ ├─ Fails if producer added required fields │
│ └─ Warns if producer removed fields │
│ │
│ 3. Subscribe with Schema Assertion │
│ client.subscribe(topic="orders.created", │
│ expected_schema="v2", │
│ on_mismatch="warn") │
│ │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ Prism Proxy (Schema Enforcement) │
├────────────────────────────────────────────────────────────┤
│ │
│ - Caches schemas from registry/GitHub │
│ - Validates published messages match declared schema │
│ - Attaches schema metadata to messages │
│ - Enforces PII tagging policy │
│ - Tracks schema versions per topic │
│ │
└────────────────────────────── ──────────────────────────────┘
Three-Tier Schema Storage
Tier 1: GitHub (Developer-Friendly, Git-Native)
Use Case: Open-source workflows, multi-repo teams, audit trail via Git history
# Producer repository structure
my-service/
├── schemas/
│ └── events/
│ ├── orders.created.v1.proto
│ ├── orders.created.v2.proto
│ └── orders.updated.v1.proto
├── prism-config.yaml
└── README.md
# prism-config.yaml
namespaces:
- name: orders
pattern: pubsub
schema:
registry_type: github
repository: github.com/myorg/my-service
path: schemas/events
branch: main # or use git tags for immutability
Schema URL Format:
github.com/myorg/my-service/blob/main/schemas/events/orders.created.v2.proto
github.com/myorg/my-service/blob/v2.1.0/schemas/events/orders.created.v2.proto # Tagged release
Pros:
- ✅ Familiar Git workflow (PR reviews, version tags)
- ✅ Public schemas for open-source projects
- ✅ Free (GitHub hosts)
- ✅ Change history and blame
- ✅ CI/CD integration via GitHub Actions
Cons:
- ❌ Requires GitHub access (not suitable for air-gapped envs)
- ❌ Rate limits (5000 req/hour authenticated)
- ❌ Latency (300-500ms per fetch)
Tier 2: Prism Schema Registry (Platform-Managed, High Performance)
Use Case: Enterprise, high-throughput, governance controls, private networks
# POST /v1/schemas
POST https://prism-registry.example.com/v1/schemas
{
"namespace": "orders",
"topic": "orders.created",
"version": "v2",
"format": "protobuf",
"schema": "<base64-encoded proto>",
"metadata": {
"owner_team": "order-team",
"pii_fields": ["email", "billing_address"],
"compatibility_mode": "backward"
}
}
# Response
{
"schema_id": "schema-abc123",
"schema_url": "prism-registry.example.com/v1/schemas/schema-abc123",
"validation": {
"compatible_with_v1": true,
"breaking_changes": [],
"warnings": ["Field 'tax_amount' added as optional"]
}
}
Pros:
- ✅ Low latency (<10ms, in-cluster)
- ✅ No external dependencies
- ✅ Governance hooks (approval workflows)
- ✅ Caching (aggressive, TTL=1h)
- ✅ Observability (metrics, audit logs)
Cons:
- ❌ Requires infrastructure (deploy + maintain registry service)
- ❌ Not Git-native (must integrate with Git repos separately)
Tier 3: Confluent Schema Registry (Kafka-Native)
Use Case: Kafka-heavy deployments, existing Confluent infrastructure
# Use Confluent REST API
POST http://kafka-schema-registry:8081/subjects/orders.created-value/versions
{
"schema": "{...protobuf IDL...}",
"schemaType": "PROTOBUF"
}
# Prism adapter translates to Confluent API
prism-config.yaml:
schema:
registry_type: confluent
url: http://kafka-schema-registry:8081
compatibility: BACKWARD
Pros:
- ✅ Kafka ecosystem integration
- ✅ Mature, battle-tested (100k+ deployments)
- ✅ Built-in compatibility checks
Cons:
- ❌ Kafka-specific (doesn't work with NATS)
- ❌ Licensing (Confluent Community vs Enterprise)
- ❌ Heavy (JVM-based, 1GB+ memory)
Comparison with Kafka Ecosystem Registries
Validation Against Existing Standards:
Prism's schema registry approach is validated against three major Kafka ecosystem registries:
Feature | Confluent Schema Registry | AWS Glue Schema Registry | Apicurio Registry | Prism Schema Registry |
---|---|---|---|---|
Protocol Support | REST | REST | REST | gRPC + REST |
Schema Formats | Avro, Protobuf, JSON Schema | Avro, JSON Schema, Protobuf | Avro, Protobuf, JSON, OpenAPI, AsyncAPI | Protobuf, JSON Schema, Avro |
Backend Lock-In | Kafka-specific | AWS-specific | Multi-backend | Multi-backend (NATS, Kafka, etc.) |
Compatibility Checking | ✅ Backward, Forward, Full | ✅ Backward, Forward, Full, None | ✅ Backward, Forward, Full | ✅ Backward, Forward, Full, None |
Schema Evolution | ✅ Subject-based versioning | ✅ Version-based | ✅ Artifact-based | ✅ Topic + namespace versioning |
Language-agnostic | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
Storage Backend | Kafka topic | DynamoDB | PostgreSQL, Kafka, Infinispan | SQLite (dev), Postgres (prod) |
Git Integration | ❌ No | ❌ No | ⚠️ External only | ✅ Native GitHub support |
Client-Side Caching | ⚠️ Manual | ⚠️ Manual | ⚠️ Manual | ✅ Built-in (namespace config) |
PII Governance | ❌ No | ❌ No | ❌ No | ✅ Prism annotations |
Deployment | JVM (1GB+) | Managed service | JVM or native | Rust (<50MB) |
Latency (P99) | 10-20ms | 20-50ms | 10-30ms | <10ms (in-cluster) |
Pricing | Free (OSS) / Enterprise $$ | Per API call | Free (OSS) | Free (OSS) |
Key Differentiators:
- Multi-Backend Support: Prism works with NATS, Kafka, RabbitMQ, etc. (not Kafka-specific)
- Git-Native: Schemas can live in GitHub repos (no separate registry infrastructure for OSS)
- Config-Time Resolution: Schema validated once at namespace config, not per-message
- PII Governance: Built-in
@prism.pii
annotations for compliance - Lightweight: Rust-based registry (50MB) vs JVM-based (1GB+)
Standard Compatibility:
Prism implements the same compatibility modes as Confluent:
- BACKWARD: New schema can read old data (add optional fields)
- FORWARD: Old schema can read new data (delete optional fields)
- FULL: Both backward and forward
- NONE: No compatibility checks
Prism can also interoperate with Confluent Schema Registry via Tier 3 adapter (see above).
Build vs Buy: Custom Prism Schema Registry Feasibility Analysis
CRITICAL DECISION: Should Prism build its own schema registry or rely on existing solutions?
Decision Criteria:
Criterion | Custom Prism Registry | Existing Solutions (Confluent, Apicurio) | Weight |
---|---|---|---|
Multi-Backend Support | ✅ Works with NATS, Kafka, Redis, etc. | ⚠️ Kafka-specific (Confluent) or heavyweight (Apicurio) | HIGH |
Development Effort | ❌ 3-4 months initial + ongoing maintenance | ✅ Zero dev effort, use off-the-shelf | HIGH |
Deployment Complexity | ⚠️ Another service to deploy/monitor | ❌ Same (JVM-based, 1GB+ memory) | MEDIUM |
Performance | ✅ Rust-based (<50MB, <10ms P99) | ⚠️ JVM overhead (100ms+ P99 at scale) | MEDIUM |
Git Integration | ✅ Native GitHub support (Tier 1) | ❌ No native Git integration | HIGH |
PII Governance | ✅ Built-in @prism.pii annotations | ❌ Not supported (manual enforcement) | MEDIUM |
Operational Maturity | ❌ New, unproven at scale | ✅ Battle-tested (100k+ deployments) | HIGH |
Ecosystem Tools | ❌ No existing tooling |