RFC-053: Schematized Topic-Based Microservice Pipelines
Executive Summary
This RFC proposes support for schematized topic-based microservice pipelines, where data flows through a chain of Kafka topics with strongly-typed schemas at each stage. Prism will provide developer tooling to discover, debug, and optimize these workflows through:
- Pipeline Discovery & Visualization - Automatic topology detection from topic schemas and consumer group subscriptions
- Schema-Aware Observability - End-to-end tracing with schema validation at each pipeline stage
- Developer Productivity Tools - Local replay, diff tooling, and pipeline testing frameworks
- Production Debugging - Data lineage tracking, schema evolution management, and error root cause analysis
Problem Statement
Current Challenges with Topic-Based Pipelines
Event-driven microservice architectures rely on pipelines where:
- Services process data sequentially through Kafka topics
- Each stage reads from input topics, transforms data, writes to output topics
- Schemas evolve independently across services
- Debugging requires manual correlation across topics, consumer groups, and traces
Pain Points:
-
Pipeline Visibility
- No automatic discovery of topic relationships
- Unclear which services consume from which topics
- Documentation becomes stale
- Topology difficult to understand
-
Schema Management
- Schema evolution breaks downstream consumers
- No validation that output matches next stage input
- Schema version compatibility unclear
- Version mismatches cause runtime failures
-
Debugging
- Errors in stage N+1 caused by bad data from earlier stages
- Manual message correlation across topics
- No tooling to replay message flows
- Production issues require extensive log analysis
-
Developer Productivity
- Local testing requires full infrastructure
- No pre-deployment validation of downstream impact
- Schema changes require manual team coordination
- Integration testing requires entire pipeline
-
Operations
- Consumer lag at one stage cascades downstream
- No unified pipeline health view
- Bottleneck identification requires manual analysis
- Inconsistent error handling across stages
Goals
Primary Goals
-
Pipeline Discovery: Automatically detect and visualize topic-based pipelines from:
- Kafka topic metadata and schema registry
- Consumer group subscriptions and topic assignments
- Producer/consumer configuration in Prism patterns
-
Schema-Aware Tracing: Provide end-to-end observability with:
- Message tracing across all pipeline stages
- Automatic schema validation at each topic boundary
- Schema evolution impact analysis
- Data lineage tracking from source to sink
-
Developer Tools: Enable productive local development with:
- Pipeline replay from any stage with historical data
- Schema diff tooling to identify breaking changes
- Local pipeline testing with mock topics
- Contract testing between pipeline stages
-
Production Debugging: Simplify troubleshooting with:
- Root cause analysis for data quality issues
- Message flow visualization for specific correlation IDs
- Schema mismatch detection and alerting
- Pipeline health dashboards with stage-specific metrics
Non-Goals
- Building a general-purpose workflow orchestration engine (use Airflow/Temporal instead)
- Supporting non-Kafka messaging systems in initial version
- Replacing existing schema registry implementations
- Providing data transformation logic (Prism is infrastructure, not business logic)
- Real-time stream processing with complex joins (use Kafka Streams/Flink)
- Batch job orchestration (use Airflow for batch pipelines)
- Schema migration tooling (use schema registry features)
- Long-term data archival (use dedicated data lake solutions)
Proposed Solution
Architecture Overview
Prism will introduce a new Pipeline Pattern that extends existing Producer/Consumer patterns with:
┌──────────────────────────────────── ─────────────────────────────┐
│ Prism Pipeline Registry │
│ - Topology Discovery │
│ - Schema Evolution Tracking │
│ - Pipeline Metadata Store │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌───────────▼──────┐ ┌──────▼────────┐ ┌────▼──────────┐
│ Producer │ │ Transform │ │ Consumer │
│ Pattern │ │ Pattern │ │ Pattern │
│ +Pipeline Ext │ │ +Pipeline │ │ +Pipeline │
└──────────────────┘ └───────────────┘ └───────────────┘
│ │ │
│ │ │
┌────▼─────┐ ┌───▼────┐ ┌───▼────┐
│ Topic A │────────▶│Topic B │───────▶│Topic C │
│ Schema 1 │ │Schema 2│ │Schema 3│
└──────────┘ └────────┘ └────────┘
Key Components
1. Pipeline Discovery Service
Responsibilities:
- Scan Kafka cluster for topics and consumer groups
- Query schema registry for topic schemas
- Correlate producers/consumers using Prism pattern metadata
- Build directed acyclic graph (DAG) of topic relationships
- Detect cycles and orphaned topics
Implementation:
message PipelineTopology {
string pipeline_id = 1;
repeated PipelineStage stages = 2;
repeated TopicEdge edges = 3;
map<string, SchemaVersion> schemas = 4;
google.protobuf.Timestamp discovered_at = 5;
}
message PipelineStage {
string stage_id = 1;
string service_name = 2;
repeated string input_topics = 3;
repeated string output_topics = 4;
string consumer_group = 5;
PipelineStageMetadata metadata = 6;
}
message TopicEdge {
string from_stage = 1;
string to_stage = 2;
string topic_name = 3;
SchemaCompatibility compatibility = 4;
}
Discovery Algorithm:
- Enumerate topics with schema registry entries
- Identify subscribed topics per consumer group
- Link consumers to output topics via Prism pattern metadata
- Build directed graph: services as nodes, topics as edges
- Validate schema compatibility at each edge
- Detect cycles and orphaned topics
- Store topology in Pipeline Registry with timestamp
2. Schema-Aware Tracing
Extend existing message envelope with pipeline context:
message PipelineContext {
string pipeline_id = 1;
string stage_id = 2;
int32 stage_sequence = 3; // Position in pipeline (0-based)
string correlation_id = 4; // End-to-end trace ID
repeated SchemaValidation validations = 5;
map<string, string> stage_metadata = 6;
DataLineage lineage = 7;
}
message SchemaValidation {
string stage_id = 1;
string schema_name = 2;
int32 schema_version = 3;
bool validation_passed = 4;
repeated string errors = 5;
google.protobuf.Timestamp validated_at = 6;
}
message DataLineage {
repeated DataSource sources = 1;
repeated Transformation transformations = 2;
repeated ModelVersion models = 3;
DataQualityMetrics quality = 4;
google.protobuf.Timestamp expires_at = 5;
repeated string pii_tags = 6;
}
message DataSource {
string source_id = 1;
string source_type = 2;
string source_location = 3;
google.protobuf.Timestamp ingested_at = 4;
map<string, string> metadata = 5;
}
message Transformation {
string transform_id = 1;
string transform_type = 2;
string code_version = 3;
map<string, double> metrics = 4;
}
message ModelVersion {
string model_id = 1;
string model_version = 2;
string framework = 3;
map<string, double> inference_metrics = 4;
}
message DataQualityMetrics {
int64 records_processed = 1;
int64 records_failed = 2;
map<string, double> quality_scores = 3;
repeated string anomalies = 4;
}
Tracing Flow:
- Producer creates message with initial
pipeline_context - Each stage validates input schema before processing
- Validation results appended to
pipeline_context.validations - Transformation metrics recorded in
pipeline_context.lineage - OpenTelemetry spans link stages via
correlation_id - Prism proxy logs validation failures with stage context
- Failed messages routed to DLQ with full lineage
Benefits:
- End-to-end message flow visibility
- Automatic schema mismatch detection
- Debugging context for production issues
- Data lineage for compliance (GDPR, SOC2)
3. Developer Tooling
A. Pipeline CLI (prismctl pipeline)
# Discover and visualize pipelines
prismctl pipeline discover --cluster=prod
# Show topology for specific pipeline
prismctl pipeline show user-signup-flow --format=dot
prismctl pipeline show user-signup-flow --format=mermaid
# Validate schema compatibility
prismctl pipeline validate user-signup-flow
# Replay messages for testing with partition awareness
prismctl pipeline replay \
--pipeline=user-signup-flow \
--from-stage=email-validator \
--correlation-id=abc123 \
--partition=2 \
--offset-range=1000-2000 \
--preserve-order=true \
--rate-limit=1000 \
--target=local
# Replay with sampling for large datasets
prismctl pipeline replay \
--pipeline=ml-feature-pipeline \
--from-stage=feature-transformer \
--time-range="2025-11-06T00:00:00Z/2025-11-07T00:00:00Z" \
--sample-rate=0.1 \
--target=local
# Diff schemas between stages
prismctl pipeline schema-diff \
--pipeline=user-signup-flow \
--from-stage=api-gateway \
--to-stage=email-validator
# Sample messages for debugging
prismctl pipeline sample \
--pipeline=ml-feature-pipeline \
--stage=feature-transformer \
--filter="$.features.click_rate > 0.8" \
--count=100 \
--output=debug_samples.jsonl
# Compare outputs between versions
prismctl pipeline diff \
--pipeline=ml-feature-pipeline \
--stage=feature-transformer \
--baseline-version=v1.2.0 \
--comparison-version=v1.3.0 \
--correlation-ids=sample.txt \
--show-distribution-stats
B. Local Development Environment
# .prism/pipelines/user-signup.yml
pipeline:
id: user-signup-flow
stages:
- id: api-gateway
service: api-service
output_topics:
- name: raw-signups
schema: SignupRequest
- id: email-validator
service: validator-service
input_topics:
- name: raw-signups
schema: SignupRequest
output_topics:
- name: validated-signups
schema: ValidatedSignup
- id: user-creator
service: user-service
input_topics:
- name: validated-signups
schema: ValidatedSignup
output_topics:
- name: created-users
schema: User
# Start local pipeline with mocked topics
prismctl pipeline dev user-signup-flow
C. Contract Testing Framework
# tests/test_pipeline_contracts.py
from prism_data.testing import PipelineContractTest
class TestUserSignupPipeline(PipelineContractTest):
pipeline_id = "user-signup-flow"
def test_email_validator_input_contract(self):
"""Ensure email-validator can consume from api-gateway output."""
self.assert_schema_compatible(
producer_stage="api-gateway",
consumer_stage="email-validator",
topic="raw-signups"
)
def test_end_to_end_message_flow(self):
"""Test complete pipeline with sample message."""
result = self.send_message(
stage="api-gateway",
message={"email": "test@example.com", "name": "Test User"}
)
self.assert_message_reaches_stage(
correlation_id=result.correlation_id,
stage="user-creator",
timeout_seconds=10
)
self.assert_schema_validations_passed(
correlation_id=result.correlation_id
)
class TestMLFeaturePipeline(PipelineContractTest):
pipeline_id = "ml-feature-pipeline"
def test_feature_transformer_data_quality(self):
"""Ensure feature transformer output meets quality thresholds."""
result = self.send_message(
stage="raw-data-ingestion",
message={"user_id": 123, "clicks": 5, "timestamp": "2025-11-07"}
)
output = self.assert_message_reaches_stage(
correlation_id=result.correlation_id,
stage="feature-transformer",
timeout_seconds=10
)
self.assert_data_quality(
output,
rules={
"completeness": {"threshold": 0.95},
"range_checks": {
"click_rate": {"min": 0.0, "max": 1.0},
"feature_count": {"min": 10, "max": 100}
},
"no_nulls": ["user_id", "feature_vector"],
"distribution": {
"feature_vector": {"mean_range": [0, 10], "std_range": [0, 5]}
}
}
)
def test_schema_drift_detection(self):
"""Alert on schema drift in feature pipeline."""
self.assert_no_schema_drift(
stage="feature-transformer",
baseline_time_range="2025-10-01/2025-10-31",
comparison_time_range="2025-11-01/2025-11-07",
max_drift_score=0.1
)
4. Production Debugging
A. Pipeline Health Dashboard
Grafana dashboard with:
- Pipeline topology visualization (auto-generated from discovery)
- Per-stage metrics: throughput, latency, error rate
- Consumer lag across all stages
- Schema validation failure rate
- Correlation ID search and trace viewer
B. Root Cause Analysis
When errors occur at stage N, Prism automatically:
- Identifies messages that failed validation
- Traces correlation ID back through pipeline
- Highlights stage where data corruption occurred
- Shows schema diff between expected and actual
- Links to relevant service logs and traces
Example Scenario:
Stage 3 (user-creator) failing with schema validation errors:
┌─────────────────────────────────────────────────────────┐
│ Root Cause: email-validator (Stage 2) │
│ │
│ Expected output schema: ValidatedSignup v2 │
│ Actual output schema: ValidatedSignup v1 │
│ │
│ Missing field: phone_number (required in v2) │
│ │
│ Action: Upgrade email-validator to produce v2 schema │
│ Or: Downgrade user-creator to accept v1 schema │
└─────────────────────────────────────────────────────────┘
C. ML-Specific Debugging
Feature Drift Detection:
- Statistical tests (Kolmogorov-Smirnov, chi-squared) on feature distributions
- Alert when distributions deviate beyond configurable threshold
- Time-series visualization of feature statistics
Model Performance Tracking:
- Track inference latency, throughput, prediction distributions per stage
- Correlate model version with performance metrics
- Automatic rollback trigger on regression beyond threshold
- Record model metadata in pipeline context lineage
Data Sampling for Debug:
- JSONPath filters for targeted message sampling
- Stratified sampling for representative debugging datasets
- Export to JSONL for offline analysis
Integration with Existing Patterns
Backward Compatibility:
- Existing Producer/Consumer patterns work unchanged
- Pipeline features opt-in via pattern configuration
- No breaking changes to protobuf message definitions
Pattern Extensions:
# Producer pattern with pipeline support
pattern: producer
pipeline:
enabled: true
pipeline_id: user-signup-flow
stage_id: api-gateway
output_topics:
- name: raw-signups
schema:
registry: confluent
subject: raw-signups-value
version: latest
# Consumer pattern with pipeline support
pattern: consumer
pipeline:
enabled: true
pipeline_id: user-signup-flow
stage_id: email-validator
input_topics:
- name: raw-signups
schema:
registry: confluent
subject: raw-signups-value
version: 2
output_topics:
- name: validated-signups
schema:
registry: confluent
subject: validated-signups-value
version: 1
Implementation Plan
Phase 1: Foundation (Weeks 1-4)
Deliverables:
- Pipeline topology protobuf definitions
- Pipeline Registry service (basic CRUD)
- Dependency: Choose storage backend (SQLite vs Postgres - ADR needed)
- Risk: Schema evolution for registry itself
- Discovery service for Kafka cluster scanning
- Dependency: Kafka Admin API rate limits - test at scale first
- Risk: Large clusters (>5000 topics) may timeout
- Schema compatibility validator
- Dependency: Integration with Confluent Schema Registry AND AWS Glue
- Risk: Different registries have incompatible APIs
-
prismctl pipeline discovercommand - Load test discovery with synthetic topics
Success Criteria:
- Can scan production Kafka cluster with throttling
- Generate pipeline topology graph in <30s for 1000 topics
- Detect schema incompatibilities with 100% accuracy vs manual verification
- Load test passes with 10,000 synthetic topics
Phase 2: Developer Tooling (Weeks 5-8)
Deliverables:
- Pipeline CLI commands (show, validate, schema-diff, replay, sample)
- Local development environment support
- Pipeline visualization (DOT/Mermaid export)
- Contract testing framework with data quality validation
- Partition-aware replay with sampling support
Success Criteria:
- Developers can visualize and replay pipelines locally
- Schema diffs identify breaking changes before deployment
- Contract tests validate schema compatibility and data quality
- Replay supports partition targeting and rate limiting
Phase 3: Tracing & Observability (Weeks 9-12)
Deliverables:
- Pipeline context in message envelope with data lineage
- Schema validation at each stage
- OpenTelemetry integration with pipeline spans
- Grafana dashboard templates with topology visualization
- Correlation ID search with trace aggregation
- ML-specific metrics (feature drift, model performance)
Success Criteria:
- End-to-end trace spans link all pipeline stages
- Schema validation failures logged and alerted with root cause
- Dashboards show pipeline health, consumer lag, validation rates
- Data lineage tracked for ML compliance
Phase 4: Production Debugging (Weeks 13-16)
Deliverables:
- Root cause analysis with automatic stage identification
- Message replay with partition and offset control
- Schema evolution impact reports with migration plans
- Pipeline health alerting rules (lag, validation failures, drift)
- Feature drift detection for ML pipelines
- Dead-letter queue integration
Success Criteria:
- Reduce MTTR for pipeline issues
- Identify root cause stage for schema errors automatically
- Support partition-aware replay with sampling
- Detect and alert on feature drift for ML workloads
Success Metrics
Developer Productivity
- Onboarding Time: Time for new developers to understand pipeline topology
- Debug Time: Time to identify root cause of pipeline failures
- Test Coverage: Percentage of pipeline stages covered by contract tests
Operational Excellence
- MTTR: Mean time to resolution for pipeline issues
- Schema Errors: Production schema validation failures caught pre-deployment
- Pipeline Visibility: Production pipelines automatically discovered and visualized
Performance & Scale
- Discovery Throughput: Support clusters with 10,000+ topics without degradation
- Discovery Latency: Complete topology scan in <30s for 1,000 topics
- Trace Overhead: Pipeline context adds <5% message size overhead
- Replay Performance: Support replay at 10,000 msg/s for debugging
- Memory Footprint: Pipeline registry uses <1GB RAM for 100 active pipelines
Schema Evolution Strategy
Prism enforces schema compatibility levels per topic to prevent breaking changes.
Compatibility Levels
BACKWARD (default): New consumers read old data
- Add optional fields: allowed
- Remove optional fields: allowed
- Add required fields: forbidden
- Remove required fields: forbidden
FORWARD: Old consumers read new data
- Add optional fields: allowed
- Remove optional fields: allowed
- Add required fields: forbidden
- Remove required fields: forbidden
FULL: Backward + Forward compatible
- Add optional fields: allowed
- All other changes: forbidden
Breaking Change Workflow
When pre-deployment validation detects incompatibility:
- CI/CD fails with schema diff showing incompatible changes
- Developer chooses migration strategy:
- Dual-write: Write to both
topicandtopic-v2during transition - Blue-green: Create
topic-v2, migrate consumers incrementally - Consumer upgrade first: Upgrade all consumers to support both schemas
- Dual-write: Write to both
- Execute migration:
prismctl pipeline migrate --strategy=dual-write - Monitor consumer lag and error rates on both topics
- Validate all consumers migrated successfully
- Deprecate old topic after configured transition period (default: 7 days)
Implementation
- Add
schema_compatibility_leveltoPipelineStageprotobuf - Validate compatibility during
prismctl pipeline validate - Block deployments that violate compatibility rules
- Generate migration plan with rollback steps
Failure Scenarios & Recovery
Scenario 1: Schema Registry Unavailable
Impact: Cannot validate schemas, discovery fails
Detection: Health check fails for schema registry client
Recovery:
- Fall back to cached schemas (5-minute TTL)
- Alert on-call engineer
- Gracefully degrade: Allow pipeline to proceed with warning
- Log all unvalidated messages for post-incident validation
Scenario 2: Consumer Lag Exceeds Threshold
Impact: Downstream stages starved, end-to-end latency increases
Detection: Consumer group lag exceeds configured threshold
Recovery:
prismctl pipeline diagnoseidentifies bottleneck stage- Auto-scale consumer instances if enabled
- Trigger backpressure to slow producers
- Alert if lag persists beyond threshold
Scenario 3: Schema Validation Failure Rate Exceeds Threshold
Impact: Data quality issues, downstream failures
Detection: Validation failure rate metric crosses threshold
Recovery:
- Identify failing stage via dashboard
- Automatically pause consumer at failing stage
- Quarantine failed messages to dead-letter queue (DLQ)
- Root cause analysis links to recent deployments
- Automated rollback if correlated with recent schema change
Scenario 4: Pipeline Topology Drift
Impact: Discovered topology doesn't match declared configuration
Detection: Topology diff between registry and actual cluster state
Recovery:
- Alert indicates drift detected
prismctl pipeline reconcileshows differences- Admin chooses: accept drift (update registry) or fix cluster
- Audit log tracks all topology changes
Multi-Tenancy & Security
Namespace Isolation
Pipelines scoped to Prism namespaces:
- Prefix all topic names with namespace:
{namespace}.{topic_name} - Pipeline CLI respects namespace permissions
- Cross-namespace pipelines require explicit authorization
Sensitive Data Handling
PII Detection: Automatically tag PII fields in schemas
Data Masking: Support redaction in replay and sampling
prismctl pipeline replay \
--pipeline=user-signup-flow \
--mask-fields="email,phone,ssn" \
--target=local
Audit Logging
- Log all pipeline topology changes with who/what/when
- Track message replay operations for forensics
- Integrate with compliance reporting (SOC2, GDPR)
Encryption
- Schema registry credentials encrypted at rest in Pipeline Registry
- Kafka credentials managed via Vault integration (ADR-028)
- TLS required for all inter-service communication
Cost & Resource Planning
Storage Costs
- Pipeline Registry: ~10MB per pipeline topology (protobuf compressed)
- Trace Data: With 1M msgs/day, 7-day retention:
- Uncompressed: ~100 bytes per message = 700GB
- Compressed (zstd): ~20GB (80% reduction)
- Sampling strategy: 100% validation failures, 1% successes = ~7GB
Compute Costs
- Discovery Service: 1 CPU core, 512MB RAM (continuous scanning)
- Schema Validation: ~0.1-1ms CPU per message depending on schema complexity
- Replay Service: Scale linearly with replay throughput (provision on-demand)
- Estimated compute: $30-50/month for 100 pipelines (t3.medium + Lambda)
Network Costs
- Admin API Calls: Discovery scans = ~1000 calls/5min = 288k calls/day
- Mitigation: Cache topology (5-minute TTL), incremental scanning
- Cross-AZ Traffic: Pipeline spanning AZs = $0.01/GB egress
- Mitigation: Same-AZ routing, replica placement awareness
- Estimated network: $5-20/month depending on cluster topology
Adoption Path & Migration
For New Pipelines
- Define pipeline in
.prism/pipelines/*.yml - Add
pipelinesection to Producer/Consumer patterns - Run
prismctl pipeline validatein CI/CD - Deploy with automatic topology registration
For Existing Pipelines
- Discovery Phase: Run
prismctl pipeline discoverto map existing topology - Annotation Phase: Add pipeline metadata to patterns incrementally
- Validation Phase: Enable schema validation (warning mode first)
- Enforcement Phase: Block deployments on validation failures
Pilot Project Recommendations
- Start with: Non-critical pipeline with 3-5 stages, <10k msg/s
- Avoid: High-throughput (>100k msg/s) or mission-critical pipelines initially
- Measure:
- MTTR for debugging pipeline issues
- Time to onboard new developer to pipeline
- Number of schema-related production incidents
- Developer satisfaction survey
- Timeline: 2-week pilot, 2-week retrospective with team feedback
- Success criteria: Developer feedback positive, no production incidents from pipeline features
Risks & Mitigations
Risk 1: Discovery Performance Impact
Risk: Scanning large clusters (5000+ topics) may impact cluster performance or timeout.
Mitigation:
- Use Kafka Admin API with rate limiting (10 requests/second)
- Cache topology with configurable TTL (default: 5 minutes)
- Incremental discovery: only scan topics modified since last scan
- Manual topology registration as fallback for large clusters
- Offload discovery to dedicated Kafka cluster replica if available
Risk 2: Schema Registry Dependency
Risk: Tight coupling to specific schema registry implementations (Confluent, AWS Glue).
Mitigation:
- Abstract schema registry behind Prism interface
- Support multiple registry providers via plugins
- Allow manual schema definition as fallback
- Maintain local schema cache for offline development
Risk 3: Message Overhead
Risk: Pipeline context adds 100-500 bytes per message (overhead varies with lineage depth).
Mitigation:
- Make pipeline context optional (opt-in per pattern)
- Use efficient protobuf encoding
- Compress validation histories when >10 validations
- Support Kafka header-based context (not in message body) to reduce payload size
- Prune old validation entries after configurable stage count (default: 10 stages)
Risk 4: Adoption Resistance
Risk: Teams resist adding pipeline metadata to existing services.
Mitigation:
- Demonstrate value with pilot project (quantify MTTR reduction)
- Provide automated migration tooling for existing pipelines
- Support gradual rollout: works with partial pipeline metadata
- Discovery works without manual metadata (infers from Kafka consumer groups)
- Create clear documentation with real-world examples
- Provide optional features: teams enable schema validation when ready
Alternatives Considered
Alternative 1: Use Existing Tools (Kafka Streams, ksqlDB)
Pros:
- Mature ecosystem with wide adoption
- Built-in topology management
- Strong community support
Cons:
- Requires rewriting services in Kafka Streams
- Limited observability for custom microservices
- No schema-aware debugging tooling
- Doesn't integrate with existing Prism patterns
Decision: Build on Prism to leverage existing patterns and provide deeper integration.
Alternative 2: Schema Registry as Single Source of Truth
Pros:
- Schema registry already tracks topic schemas
- No additional metadata storage needed
Cons:
- Schema registry doesn't track consumer relationships
- Can't represent pipeline topology
- Limited to schema-level metadata
- Doesn't support custom pipeline stages
Decision: Use schema registry as data source but maintain pipeline topology separately.
Alternative 3: Kafka Connector-Based Approach
Pros:
- Kafka Connect has built-in topology concepts
- Integrates with existing connector ecosystem
Cons:
- Limited to Kafka Connect-based services
- Doesn't support custom microservices
- Poor observability for debugging
- No schema validation tooling
Decision: Support Kafka Connect as one type of pipeline stage but don't limit to connectors.
Open Questions
-
Multi-Cluster Pipelines: Should Prism support pipelines spanning multiple Kafka clusters?
- How to correlate topics across clusters (identical topic names vs prefix)?
- Cross-cluster latency monitoring and alerting?
- MirrorMaker2 integration for topic replication?
-
Pipeline Versioning: How should pipeline topology versions be managed?
- Git-based versioning with declarative YAML?
- Immutable pipeline versions stored in registry?
- Rollback support: revert to previous topology version?
- Diff between topology versions for change tracking?
-
Access Control: How should pipeline-level permissions work?
- Per-stage access control: restrict who can modify each stage?
- Pipeline-wide RBAC: permissions for entire pipeline?
- Integration with existing Prism namespace AuthZ?
- Separate read/write permissions for pipeline metadata vs data access?
References
- ADR-031: Message Envelope Protocol
- RFC-031: Message Envelope Protocol
- RFC-037: Mailbox Pattern - Searchable Event Store
- RFC-046: Consolidated Pattern Protocols
- Confluent Schema Registry Documentation
- Kafka Streams Topology Documentation
- OpenTelemetry Semantic Conventions for Messaging
Changelog
- 2025-01-07: Add performance targets, schema evolution strategy, data lineage for ML, partition-aware replay, data quality validation, ML debugging features, cost planning, failure scenarios, multi-tenancy/security, adoption path
- 2025-11-07: Initial draft