Skip to main content

RFC-053: Schematized Topic-Based Microservice Pipelines

Executive Summary

This RFC proposes support for schematized topic-based microservice pipelines, where data flows through a chain of Kafka topics with strongly-typed schemas at each stage. Prism will provide developer tooling to discover, debug, and optimize these workflows through:

  1. Pipeline Discovery & Visualization - Automatic topology detection from topic schemas and consumer group subscriptions
  2. Schema-Aware Observability - End-to-end tracing with schema validation at each pipeline stage
  3. Developer Productivity Tools - Local replay, diff tooling, and pipeline testing frameworks
  4. Production Debugging - Data lineage tracking, schema evolution management, and error root cause analysis

Problem Statement

Current Challenges with Topic-Based Pipelines

Event-driven microservice architectures rely on pipelines where:

  • Services process data sequentially through Kafka topics
  • Each stage reads from input topics, transforms data, writes to output topics
  • Schemas evolve independently across services
  • Debugging requires manual correlation across topics, consumer groups, and traces

Pain Points:

  1. Pipeline Visibility

    • No automatic discovery of topic relationships
    • Unclear which services consume from which topics
    • Documentation becomes stale
    • Topology difficult to understand
  2. Schema Management

    • Schema evolution breaks downstream consumers
    • No validation that output matches next stage input
    • Schema version compatibility unclear
    • Version mismatches cause runtime failures
  3. Debugging

    • Errors in stage N+1 caused by bad data from earlier stages
    • Manual message correlation across topics
    • No tooling to replay message flows
    • Production issues require extensive log analysis
  4. Developer Productivity

    • Local testing requires full infrastructure
    • No pre-deployment validation of downstream impact
    • Schema changes require manual team coordination
    • Integration testing requires entire pipeline
  5. Operations

    • Consumer lag at one stage cascades downstream
    • No unified pipeline health view
    • Bottleneck identification requires manual analysis
    • Inconsistent error handling across stages

Goals

Primary Goals

  1. Pipeline Discovery: Automatically detect and visualize topic-based pipelines from:

    • Kafka topic metadata and schema registry
    • Consumer group subscriptions and topic assignments
    • Producer/consumer configuration in Prism patterns
  2. Schema-Aware Tracing: Provide end-to-end observability with:

    • Message tracing across all pipeline stages
    • Automatic schema validation at each topic boundary
    • Schema evolution impact analysis
    • Data lineage tracking from source to sink
  3. Developer Tools: Enable productive local development with:

    • Pipeline replay from any stage with historical data
    • Schema diff tooling to identify breaking changes
    • Local pipeline testing with mock topics
    • Contract testing between pipeline stages
  4. Production Debugging: Simplify troubleshooting with:

    • Root cause analysis for data quality issues
    • Message flow visualization for specific correlation IDs
    • Schema mismatch detection and alerting
    • Pipeline health dashboards with stage-specific metrics

Non-Goals

  • Building a general-purpose workflow orchestration engine (use Airflow/Temporal instead)
  • Supporting non-Kafka messaging systems in initial version
  • Replacing existing schema registry implementations
  • Providing data transformation logic (Prism is infrastructure, not business logic)
  • Real-time stream processing with complex joins (use Kafka Streams/Flink)
  • Batch job orchestration (use Airflow for batch pipelines)
  • Schema migration tooling (use schema registry features)
  • Long-term data archival (use dedicated data lake solutions)

Proposed Solution

Architecture Overview

Prism will introduce a new Pipeline Pattern that extends existing Producer/Consumer patterns with:

┌─────────────────────────────────────────────────────────────────┐
│ Prism Pipeline Registry │
│ - Topology Discovery │
│ - Schema Evolution Tracking │
│ - Pipeline Metadata Store │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────┼─────────────────┐
│ │ │
┌───────────▼──────┐ ┌──────▼────────┐ ┌────▼──────────┐
│ Producer │ │ Transform │ │ Consumer │
│ Pattern │ │ Pattern │ │ Pattern │
│ +Pipeline Ext │ │ +Pipeline │ │ +Pipeline │
└──────────────────┘ └───────────────┘ └───────────────┘
│ │ │
│ │ │
┌────▼─────┐ ┌───▼────┐ ┌───▼────┐
│ Topic A │────────▶│Topic B │───────▶│Topic C │
│ Schema 1 │ │Schema 2│ │Schema 3│
└──────────┘ └────────┘ └────────┘

Key Components

1. Pipeline Discovery Service

Responsibilities:

  • Scan Kafka cluster for topics and consumer groups
  • Query schema registry for topic schemas
  • Correlate producers/consumers using Prism pattern metadata
  • Build directed acyclic graph (DAG) of topic relationships
  • Detect cycles and orphaned topics

Implementation:

message PipelineTopology {
string pipeline_id = 1;
repeated PipelineStage stages = 2;
repeated TopicEdge edges = 3;
map<string, SchemaVersion> schemas = 4;
google.protobuf.Timestamp discovered_at = 5;
}

message PipelineStage {
string stage_id = 1;
string service_name = 2;
repeated string input_topics = 3;
repeated string output_topics = 4;
string consumer_group = 5;
PipelineStageMetadata metadata = 6;
}

message TopicEdge {
string from_stage = 1;
string to_stage = 2;
string topic_name = 3;
SchemaCompatibility compatibility = 4;
}

Discovery Algorithm:

  1. Enumerate topics with schema registry entries
  2. Identify subscribed topics per consumer group
  3. Link consumers to output topics via Prism pattern metadata
  4. Build directed graph: services as nodes, topics as edges
  5. Validate schema compatibility at each edge
  6. Detect cycles and orphaned topics
  7. Store topology in Pipeline Registry with timestamp

2. Schema-Aware Tracing

Extend existing message envelope with pipeline context:

message PipelineContext {
string pipeline_id = 1;
string stage_id = 2;
int32 stage_sequence = 3; // Position in pipeline (0-based)
string correlation_id = 4; // End-to-end trace ID
repeated SchemaValidation validations = 5;
map<string, string> stage_metadata = 6;
DataLineage lineage = 7;
}

message SchemaValidation {
string stage_id = 1;
string schema_name = 2;
int32 schema_version = 3;
bool validation_passed = 4;
repeated string errors = 5;
google.protobuf.Timestamp validated_at = 6;
}

message DataLineage {
repeated DataSource sources = 1;
repeated Transformation transformations = 2;
repeated ModelVersion models = 3;
DataQualityMetrics quality = 4;
google.protobuf.Timestamp expires_at = 5;
repeated string pii_tags = 6;
}

message DataSource {
string source_id = 1;
string source_type = 2;
string source_location = 3;
google.protobuf.Timestamp ingested_at = 4;
map<string, string> metadata = 5;
}

message Transformation {
string transform_id = 1;
string transform_type = 2;
string code_version = 3;
map<string, double> metrics = 4;
}

message ModelVersion {
string model_id = 1;
string model_version = 2;
string framework = 3;
map<string, double> inference_metrics = 4;
}

message DataQualityMetrics {
int64 records_processed = 1;
int64 records_failed = 2;
map<string, double> quality_scores = 3;
repeated string anomalies = 4;
}

Tracing Flow:

  1. Producer creates message with initial pipeline_context
  2. Each stage validates input schema before processing
  3. Validation results appended to pipeline_context.validations
  4. Transformation metrics recorded in pipeline_context.lineage
  5. OpenTelemetry spans link stages via correlation_id
  6. Prism proxy logs validation failures with stage context
  7. Failed messages routed to DLQ with full lineage

Benefits:

  • End-to-end message flow visibility
  • Automatic schema mismatch detection
  • Debugging context for production issues
  • Data lineage for compliance (GDPR, SOC2)

3. Developer Tooling

A. Pipeline CLI (prismctl pipeline)

# Discover and visualize pipelines
prismctl pipeline discover --cluster=prod

# Show topology for specific pipeline
prismctl pipeline show user-signup-flow --format=dot
prismctl pipeline show user-signup-flow --format=mermaid

# Validate schema compatibility
prismctl pipeline validate user-signup-flow

# Replay messages for testing with partition awareness
prismctl pipeline replay \
--pipeline=user-signup-flow \
--from-stage=email-validator \
--correlation-id=abc123 \
--partition=2 \
--offset-range=1000-2000 \
--preserve-order=true \
--rate-limit=1000 \
--target=local

# Replay with sampling for large datasets
prismctl pipeline replay \
--pipeline=ml-feature-pipeline \
--from-stage=feature-transformer \
--time-range="2025-11-06T00:00:00Z/2025-11-07T00:00:00Z" \
--sample-rate=0.1 \
--target=local

# Diff schemas between stages
prismctl pipeline schema-diff \
--pipeline=user-signup-flow \
--from-stage=api-gateway \
--to-stage=email-validator

# Sample messages for debugging
prismctl pipeline sample \
--pipeline=ml-feature-pipeline \
--stage=feature-transformer \
--filter="$.features.click_rate > 0.8" \
--count=100 \
--output=debug_samples.jsonl

# Compare outputs between versions
prismctl pipeline diff \
--pipeline=ml-feature-pipeline \
--stage=feature-transformer \
--baseline-version=v1.2.0 \
--comparison-version=v1.3.0 \
--correlation-ids=sample.txt \
--show-distribution-stats

B. Local Development Environment

# .prism/pipelines/user-signup.yml
pipeline:
id: user-signup-flow
stages:
- id: api-gateway
service: api-service
output_topics:
- name: raw-signups
schema: SignupRequest

- id: email-validator
service: validator-service
input_topics:
- name: raw-signups
schema: SignupRequest
output_topics:
- name: validated-signups
schema: ValidatedSignup

- id: user-creator
service: user-service
input_topics:
- name: validated-signups
schema: ValidatedSignup
output_topics:
- name: created-users
schema: User

# Start local pipeline with mocked topics
prismctl pipeline dev user-signup-flow

C. Contract Testing Framework

# tests/test_pipeline_contracts.py
from prism_data.testing import PipelineContractTest

class TestUserSignupPipeline(PipelineContractTest):
pipeline_id = "user-signup-flow"

def test_email_validator_input_contract(self):
"""Ensure email-validator can consume from api-gateway output."""
self.assert_schema_compatible(
producer_stage="api-gateway",
consumer_stage="email-validator",
topic="raw-signups"
)

def test_end_to_end_message_flow(self):
"""Test complete pipeline with sample message."""
result = self.send_message(
stage="api-gateway",
message={"email": "test@example.com", "name": "Test User"}
)

self.assert_message_reaches_stage(
correlation_id=result.correlation_id,
stage="user-creator",
timeout_seconds=10
)

self.assert_schema_validations_passed(
correlation_id=result.correlation_id
)

class TestMLFeaturePipeline(PipelineContractTest):
pipeline_id = "ml-feature-pipeline"

def test_feature_transformer_data_quality(self):
"""Ensure feature transformer output meets quality thresholds."""
result = self.send_message(
stage="raw-data-ingestion",
message={"user_id": 123, "clicks": 5, "timestamp": "2025-11-07"}
)

output = self.assert_message_reaches_stage(
correlation_id=result.correlation_id,
stage="feature-transformer",
timeout_seconds=10
)

self.assert_data_quality(
output,
rules={
"completeness": {"threshold": 0.95},
"range_checks": {
"click_rate": {"min": 0.0, "max": 1.0},
"feature_count": {"min": 10, "max": 100}
},
"no_nulls": ["user_id", "feature_vector"],
"distribution": {
"feature_vector": {"mean_range": [0, 10], "std_range": [0, 5]}
}
}
)

def test_schema_drift_detection(self):
"""Alert on schema drift in feature pipeline."""
self.assert_no_schema_drift(
stage="feature-transformer",
baseline_time_range="2025-10-01/2025-10-31",
comparison_time_range="2025-11-01/2025-11-07",
max_drift_score=0.1
)

4. Production Debugging

A. Pipeline Health Dashboard

Grafana dashboard with:

  • Pipeline topology visualization (auto-generated from discovery)
  • Per-stage metrics: throughput, latency, error rate
  • Consumer lag across all stages
  • Schema validation failure rate
  • Correlation ID search and trace viewer

B. Root Cause Analysis

When errors occur at stage N, Prism automatically:

  1. Identifies messages that failed validation
  2. Traces correlation ID back through pipeline
  3. Highlights stage where data corruption occurred
  4. Shows schema diff between expected and actual
  5. Links to relevant service logs and traces

Example Scenario:

Stage 3 (user-creator) failing with schema validation errors:

┌─────────────────────────────────────────────────────────┐
│ Root Cause: email-validator (Stage 2) │
│ │
│ Expected output schema: ValidatedSignup v2 │
│ Actual output schema: ValidatedSignup v1 │
│ │
│ Missing field: phone_number (required in v2) │
│ │
│ Action: Upgrade email-validator to produce v2 schema │
│ Or: Downgrade user-creator to accept v1 schema │
└─────────────────────────────────────────────────────────┘

C. ML-Specific Debugging

Feature Drift Detection:

  • Statistical tests (Kolmogorov-Smirnov, chi-squared) on feature distributions
  • Alert when distributions deviate beyond configurable threshold
  • Time-series visualization of feature statistics

Model Performance Tracking:

  • Track inference latency, throughput, prediction distributions per stage
  • Correlate model version with performance metrics
  • Automatic rollback trigger on regression beyond threshold
  • Record model metadata in pipeline context lineage

Data Sampling for Debug:

  • JSONPath filters for targeted message sampling
  • Stratified sampling for representative debugging datasets
  • Export to JSONL for offline analysis

Integration with Existing Patterns

Backward Compatibility:

  • Existing Producer/Consumer patterns work unchanged
  • Pipeline features opt-in via pattern configuration
  • No breaking changes to protobuf message definitions

Pattern Extensions:

# Producer pattern with pipeline support
pattern: producer
pipeline:
enabled: true
pipeline_id: user-signup-flow
stage_id: api-gateway
output_topics:
- name: raw-signups
schema:
registry: confluent
subject: raw-signups-value
version: latest

# Consumer pattern with pipeline support
pattern: consumer
pipeline:
enabled: true
pipeline_id: user-signup-flow
stage_id: email-validator
input_topics:
- name: raw-signups
schema:
registry: confluent
subject: raw-signups-value
version: 2
output_topics:
- name: validated-signups
schema:
registry: confluent
subject: validated-signups-value
version: 1

Implementation Plan

Phase 1: Foundation (Weeks 1-4)

Deliverables:

  • Pipeline topology protobuf definitions
  • Pipeline Registry service (basic CRUD)
    • Dependency: Choose storage backend (SQLite vs Postgres - ADR needed)
    • Risk: Schema evolution for registry itself
  • Discovery service for Kafka cluster scanning
    • Dependency: Kafka Admin API rate limits - test at scale first
    • Risk: Large clusters (>5000 topics) may timeout
  • Schema compatibility validator
    • Dependency: Integration with Confluent Schema Registry AND AWS Glue
    • Risk: Different registries have incompatible APIs
  • prismctl pipeline discover command
  • Load test discovery with synthetic topics

Success Criteria:

  • Can scan production Kafka cluster with throttling
  • Generate pipeline topology graph in <30s for 1000 topics
  • Detect schema incompatibilities with 100% accuracy vs manual verification
  • Load test passes with 10,000 synthetic topics

Phase 2: Developer Tooling (Weeks 5-8)

Deliverables:

  • Pipeline CLI commands (show, validate, schema-diff, replay, sample)
  • Local development environment support
  • Pipeline visualization (DOT/Mermaid export)
  • Contract testing framework with data quality validation
  • Partition-aware replay with sampling support

Success Criteria:

  • Developers can visualize and replay pipelines locally
  • Schema diffs identify breaking changes before deployment
  • Contract tests validate schema compatibility and data quality
  • Replay supports partition targeting and rate limiting

Phase 3: Tracing & Observability (Weeks 9-12)

Deliverables:

  • Pipeline context in message envelope with data lineage
  • Schema validation at each stage
  • OpenTelemetry integration with pipeline spans
  • Grafana dashboard templates with topology visualization
  • Correlation ID search with trace aggregation
  • ML-specific metrics (feature drift, model performance)

Success Criteria:

  • End-to-end trace spans link all pipeline stages
  • Schema validation failures logged and alerted with root cause
  • Dashboards show pipeline health, consumer lag, validation rates
  • Data lineage tracked for ML compliance

Phase 4: Production Debugging (Weeks 13-16)

Deliverables:

  • Root cause analysis with automatic stage identification
  • Message replay with partition and offset control
  • Schema evolution impact reports with migration plans
  • Pipeline health alerting rules (lag, validation failures, drift)
  • Feature drift detection for ML pipelines
  • Dead-letter queue integration

Success Criteria:

  • Reduce MTTR for pipeline issues
  • Identify root cause stage for schema errors automatically
  • Support partition-aware replay with sampling
  • Detect and alert on feature drift for ML workloads

Success Metrics

Developer Productivity

  • Onboarding Time: Time for new developers to understand pipeline topology
  • Debug Time: Time to identify root cause of pipeline failures
  • Test Coverage: Percentage of pipeline stages covered by contract tests

Operational Excellence

  • MTTR: Mean time to resolution for pipeline issues
  • Schema Errors: Production schema validation failures caught pre-deployment
  • Pipeline Visibility: Production pipelines automatically discovered and visualized

Performance & Scale

  • Discovery Throughput: Support clusters with 10,000+ topics without degradation
  • Discovery Latency: Complete topology scan in <30s for 1,000 topics
  • Trace Overhead: Pipeline context adds <5% message size overhead
  • Replay Performance: Support replay at 10,000 msg/s for debugging
  • Memory Footprint: Pipeline registry uses <1GB RAM for 100 active pipelines

Schema Evolution Strategy

Prism enforces schema compatibility levels per topic to prevent breaking changes.

Compatibility Levels

BACKWARD (default): New consumers read old data

  • Add optional fields: allowed
  • Remove optional fields: allowed
  • Add required fields: forbidden
  • Remove required fields: forbidden

FORWARD: Old consumers read new data

  • Add optional fields: allowed
  • Remove optional fields: allowed
  • Add required fields: forbidden
  • Remove required fields: forbidden

FULL: Backward + Forward compatible

  • Add optional fields: allowed
  • All other changes: forbidden

Breaking Change Workflow

When pre-deployment validation detects incompatibility:

  1. CI/CD fails with schema diff showing incompatible changes
  2. Developer chooses migration strategy:
    • Dual-write: Write to both topic and topic-v2 during transition
    • Blue-green: Create topic-v2, migrate consumers incrementally
    • Consumer upgrade first: Upgrade all consumers to support both schemas
  3. Execute migration: prismctl pipeline migrate --strategy=dual-write
  4. Monitor consumer lag and error rates on both topics
  5. Validate all consumers migrated successfully
  6. Deprecate old topic after configured transition period (default: 7 days)

Implementation

  • Add schema_compatibility_level to PipelineStage protobuf
  • Validate compatibility during prismctl pipeline validate
  • Block deployments that violate compatibility rules
  • Generate migration plan with rollback steps

Failure Scenarios & Recovery

Scenario 1: Schema Registry Unavailable

Impact: Cannot validate schemas, discovery fails

Detection: Health check fails for schema registry client

Recovery:

  1. Fall back to cached schemas (5-minute TTL)
  2. Alert on-call engineer
  3. Gracefully degrade: Allow pipeline to proceed with warning
  4. Log all unvalidated messages for post-incident validation

Scenario 2: Consumer Lag Exceeds Threshold

Impact: Downstream stages starved, end-to-end latency increases

Detection: Consumer group lag exceeds configured threshold

Recovery:

  1. prismctl pipeline diagnose identifies bottleneck stage
  2. Auto-scale consumer instances if enabled
  3. Trigger backpressure to slow producers
  4. Alert if lag persists beyond threshold

Scenario 3: Schema Validation Failure Rate Exceeds Threshold

Impact: Data quality issues, downstream failures

Detection: Validation failure rate metric crosses threshold

Recovery:

  1. Identify failing stage via dashboard
  2. Automatically pause consumer at failing stage
  3. Quarantine failed messages to dead-letter queue (DLQ)
  4. Root cause analysis links to recent deployments
  5. Automated rollback if correlated with recent schema change

Scenario 4: Pipeline Topology Drift

Impact: Discovered topology doesn't match declared configuration

Detection: Topology diff between registry and actual cluster state

Recovery:

  1. Alert indicates drift detected
  2. prismctl pipeline reconcile shows differences
  3. Admin chooses: accept drift (update registry) or fix cluster
  4. Audit log tracks all topology changes

Multi-Tenancy & Security

Namespace Isolation

Pipelines scoped to Prism namespaces:

  • Prefix all topic names with namespace: {namespace}.{topic_name}
  • Pipeline CLI respects namespace permissions
  • Cross-namespace pipelines require explicit authorization

Sensitive Data Handling

PII Detection: Automatically tag PII fields in schemas

Data Masking: Support redaction in replay and sampling

prismctl pipeline replay \
--pipeline=user-signup-flow \
--mask-fields="email,phone,ssn" \
--target=local

Audit Logging

  • Log all pipeline topology changes with who/what/when
  • Track message replay operations for forensics
  • Integrate with compliance reporting (SOC2, GDPR)

Encryption

  • Schema registry credentials encrypted at rest in Pipeline Registry
  • Kafka credentials managed via Vault integration (ADR-028)
  • TLS required for all inter-service communication

Cost & Resource Planning

Storage Costs

  • Pipeline Registry: ~10MB per pipeline topology (protobuf compressed)
  • Trace Data: With 1M msgs/day, 7-day retention:
    • Uncompressed: ~100 bytes per message = 700GB
    • Compressed (zstd): ~20GB (80% reduction)
    • Sampling strategy: 100% validation failures, 1% successes = ~7GB

Compute Costs

  • Discovery Service: 1 CPU core, 512MB RAM (continuous scanning)
  • Schema Validation: ~0.1-1ms CPU per message depending on schema complexity
  • Replay Service: Scale linearly with replay throughput (provision on-demand)
  • Estimated compute: $30-50/month for 100 pipelines (t3.medium + Lambda)

Network Costs

  • Admin API Calls: Discovery scans = ~1000 calls/5min = 288k calls/day
    • Mitigation: Cache topology (5-minute TTL), incremental scanning
  • Cross-AZ Traffic: Pipeline spanning AZs = $0.01/GB egress
    • Mitigation: Same-AZ routing, replica placement awareness
  • Estimated network: $5-20/month depending on cluster topology

Adoption Path & Migration

For New Pipelines

  1. Define pipeline in .prism/pipelines/*.yml
  2. Add pipeline section to Producer/Consumer patterns
  3. Run prismctl pipeline validate in CI/CD
  4. Deploy with automatic topology registration

For Existing Pipelines

  1. Discovery Phase: Run prismctl pipeline discover to map existing topology
  2. Annotation Phase: Add pipeline metadata to patterns incrementally
  3. Validation Phase: Enable schema validation (warning mode first)
  4. Enforcement Phase: Block deployments on validation failures

Pilot Project Recommendations

  • Start with: Non-critical pipeline with 3-5 stages, <10k msg/s
  • Avoid: High-throughput (>100k msg/s) or mission-critical pipelines initially
  • Measure:
    • MTTR for debugging pipeline issues
    • Time to onboard new developer to pipeline
    • Number of schema-related production incidents
    • Developer satisfaction survey
  • Timeline: 2-week pilot, 2-week retrospective with team feedback
  • Success criteria: Developer feedback positive, no production incidents from pipeline features

Risks & Mitigations

Risk 1: Discovery Performance Impact

Risk: Scanning large clusters (5000+ topics) may impact cluster performance or timeout.

Mitigation:

  • Use Kafka Admin API with rate limiting (10 requests/second)
  • Cache topology with configurable TTL (default: 5 minutes)
  • Incremental discovery: only scan topics modified since last scan
  • Manual topology registration as fallback for large clusters
  • Offload discovery to dedicated Kafka cluster replica if available

Risk 2: Schema Registry Dependency

Risk: Tight coupling to specific schema registry implementations (Confluent, AWS Glue).

Mitigation:

  • Abstract schema registry behind Prism interface
  • Support multiple registry providers via plugins
  • Allow manual schema definition as fallback
  • Maintain local schema cache for offline development

Risk 3: Message Overhead

Risk: Pipeline context adds 100-500 bytes per message (overhead varies with lineage depth).

Mitigation:

  • Make pipeline context optional (opt-in per pattern)
  • Use efficient protobuf encoding
  • Compress validation histories when >10 validations
  • Support Kafka header-based context (not in message body) to reduce payload size
  • Prune old validation entries after configurable stage count (default: 10 stages)

Risk 4: Adoption Resistance

Risk: Teams resist adding pipeline metadata to existing services.

Mitigation:

  • Demonstrate value with pilot project (quantify MTTR reduction)
  • Provide automated migration tooling for existing pipelines
  • Support gradual rollout: works with partial pipeline metadata
  • Discovery works without manual metadata (infers from Kafka consumer groups)
  • Create clear documentation with real-world examples
  • Provide optional features: teams enable schema validation when ready

Alternatives Considered

Alternative 1: Use Existing Tools (Kafka Streams, ksqlDB)

Pros:

  • Mature ecosystem with wide adoption
  • Built-in topology management
  • Strong community support

Cons:

  • Requires rewriting services in Kafka Streams
  • Limited observability for custom microservices
  • No schema-aware debugging tooling
  • Doesn't integrate with existing Prism patterns

Decision: Build on Prism to leverage existing patterns and provide deeper integration.

Alternative 2: Schema Registry as Single Source of Truth

Pros:

  • Schema registry already tracks topic schemas
  • No additional metadata storage needed

Cons:

  • Schema registry doesn't track consumer relationships
  • Can't represent pipeline topology
  • Limited to schema-level metadata
  • Doesn't support custom pipeline stages

Decision: Use schema registry as data source but maintain pipeline topology separately.

Alternative 3: Kafka Connector-Based Approach

Pros:

  • Kafka Connect has built-in topology concepts
  • Integrates with existing connector ecosystem

Cons:

  • Limited to Kafka Connect-based services
  • Doesn't support custom microservices
  • Poor observability for debugging
  • No schema validation tooling

Decision: Support Kafka Connect as one type of pipeline stage but don't limit to connectors.

Open Questions

  1. Multi-Cluster Pipelines: Should Prism support pipelines spanning multiple Kafka clusters?

    • How to correlate topics across clusters (identical topic names vs prefix)?
    • Cross-cluster latency monitoring and alerting?
    • MirrorMaker2 integration for topic replication?
  2. Pipeline Versioning: How should pipeline topology versions be managed?

    • Git-based versioning with declarative YAML?
    • Immutable pipeline versions stored in registry?
    • Rollback support: revert to previous topology version?
    • Diff between topology versions for change tracking?
  3. Access Control: How should pipeline-level permissions work?

    • Per-stage access control: restrict who can modify each stage?
    • Pipeline-wide RBAC: permissions for entire pipeline?
    • Integration with existing Prism namespace AuthZ?
    • Separate read/write permissions for pipeline metadata vs data access?

References

Changelog

  • 2025-01-07: Add performance targets, schema evolution strategy, data lineage for ML, partition-aware replay, data quality validation, ML debugging features, cost planning, failure scenarios, multi-tenancy/security, adoption path
  • 2025-11-07: Initial draft