Kubernetes Operator with Custom Resource Definitions
Context
Managing Prism deployments at scale requires automation for:
- Namespace Lifecycle: Creating, updating, deleting namespaces across multiple Prism instances
- Shard Management: Deploying product/feature-based shards (ADR-034)
- Plugin Installation: Distributing plugins across instances
- Configuration Sync: Keeping namespace configs consistent across replicas
- Resource Management: CPU/memory limits, autoscaling, health checks
Manual Management Pain Points
Without automation:
- YAML Hell: Manually maintaining hundreds of namespace config files
- Deployment Complexity: kubectl apply across multiple files, error-prone
- Inconsistency: Config drift between Prism instances
- No GitOps: Can't declaratively manage Prism infrastructure as code
- Slow Iteration: Namespace changes require manual updates to multiple instances
Kubernetes Operator Pattern
Operators extend Kubernetes with custom logic to manage applications:
- CRDs (Custom Resource Definitions): Define custom resources (e.g.,
PrismNamespace) - Controller: Watches CRDs, reconciles desired state → actual state
- Declarative: Describe what you want, operator figures out how
Examples: PostgreSQL Operator, Kafka Operator, Istio Operator
Decision
Build a Prism Kubernetes Operator that manages Prism deployments via Custom Resource Definitions (CRDs).
Status: ✅ IMPLEMENTED - The operator is fully implemented with auto-scaling support via HPA and KEDA. See prism-operator/ directory.
Implemented Custom Resources
The operator currently implements two primary CRDs:
1. PrismPattern (Implemented)
Individual pattern runner with auto-scaling support:
apiVersion: prism.io/v1alpha1
kind: PrismPattern
metadata:
name: consumer-kafka-orders
spec:
pattern: consumer
backend: kafka
image: ghcr.io/prism/consumer-runner:latest
replicas: 2
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "4Gi"
service:
type: ClusterIP
port: 8080
# HPA or KEDA auto-scaling
autoscaling:
enabled: true
scaler: keda # or "hpa"
minReplicas: 2
maxReplicas: 50
pollingInterval: 10
cooldownPeriod: 300
triggers:
- type: kafka
metadata:
bootstrapServers: "kafka:9092"
consumerGroup: "prism-orders"
topic: "orders"
lagThreshold: "1000"
placement:
nodeSelector:
workload-type: compute-intensive
status:
created: 2025-10-10
updated: 2025-11-15
phase: Running # Pending → Progressing → Running
replicas: 5
availableReplicas: 5
conditions:
- type: Ready
status: "True"
created: 2025-10-10
updated: 2025-11-15
2. PrismStack (Partially Implemented)
Complete Prism deployment including proxy, admin, patterns, and backends:
apiVersion: prism.io/v1alpha1
kind: PrismStack
metadata:
name: production-stack
spec:
proxy:
image: ghcr.io/prism/prism-proxy:latest
replicas: 5
port: 8980
autoscaling:
enabled: true
scaler: hpa
minReplicas: 5
maxReplicas: 20
targetCPUUtilizationPercentage: 75
admin:
enabled: true
port: 8981
replicas: 3
leaderElection:
enabled: true
patterns:
- name: orders-consumer
type: consumer
backend: kafka
replicas: 10
backends:
- name: kafka-prod
type: kafka
connectionString: "kafka.prod.svc:9092"
status:
created: 2025-10-10
updated: 2025-11-15
phase: Running
conditions:
- type: Ready
status: "True"
created: 2025-10-10
updated: 2025-11-15
Note: PrismStack CRD is defined but the controller is not yet fully implemented. Focus is on PrismPattern which is production-ready.
Future CRDs (Planned)
The following CRDs are planned for future implementation:
- PrismShard: Product/feature-based sharding (from ADR-034)
- PrismPlugin: Plugin lifecycle management
- PrismNamespace: Multi-tenant namespace provisioning
- PrismBackend: Backend connection configuration
Operator Architecture (Implemented)
┌──────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Prism Operator (Controller) │ │
│ │ │ │
│ │ Watches: │ │
│ │ - PrismPattern CRDs │ │
│ │ - PrismStack CRDs (partial) │ │
│ │ │ │
│ │ Reconciles: │ │
│ │ 1. Creates/updates Deployment │ │
│ │ 2. Creates/updates Service │ │
│ │ 3. Creates HPA (CPU/memory) or KEDA ScaledObject │ │
│ │ 4. Updates status with phase and conditions │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ PrismPattern: consumer-kafka-orders │ │
│ │ │ │
│ │ Deployment Service │ │
│ │ ┌─────────┐ ┌────────┐ │ │
│ │ │ Pod 1 │ │ :8080 │ │ │
│ │ │ Pod 2 │──────────────│ │ │ │
│ │ │ ... │ └────────┘ │ │
│ │ │ Pod N │ │ │
│ │ └─────────┘ │ │
│ │ │ │ │
│ │ └─── Scaled by: HPA or KEDA ScaledObject │ │
│ │ │ │
│ │ Auto-Scaling Triggers: │ │
│ │ - CPU/Memory (HPA) │ │
│ │ - Kafka lag (KEDA) │ │
│ │ - NATS queue depth (KEDA) │ │
│ │ - SQS queue length (KEDA) │ │
│ │ - 60+ other KEDA scalers │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Reconciliation Logic (Implemented)
When a PrismPattern is created or updated:
-
Reconcile Deployment:
- Create or update Deployment with specified image, replicas, resources
- Apply placement constraints (nodeSelector, affinity, tolerations)
- Set owner reference for garbage collection
-
Reconcile Service (if
spec.serviceis defined):- Create or update Service with specified type and port
- Add labels for pattern identification
-
Reconcile Auto-Scaling (if
spec.autoscaling.enabled == true):- HPA Mode (
scaler: hpa):- Create HorizontalPodAutoscaler with CPU/memory targets
- Apply scaling behavior policies if specified
- KEDA Mode (
scaler: keda):- Create KEDA ScaledObject with triggers (Kafka, NATS, SQS, etc.)
- Configure polling interval and cooldown period
- Support multiple triggers simultaneously
- HPA Mode (
-
Update Status:
- Set phase:
Pending→Progressing→Running - Update replica counts from Deployment
- Add Ready condition with message
- Set phase:
Supported KEDA Scalers:
- Kafka (consumer lag)
- NATS JetStream (pending messages)
- AWS SQS (queue depth)
- RabbitMQ (queue length)
- Redis (list/stream length)
- PostgreSQL (custom queries)
- 60+ more scalers
See prism-operator/README.md for complete documentation.
Rationale
Why Custom Operator vs Raw Kubernetes?
Without Operator (raw Kubernetes manifests):
# Must manually define:
- Deployment for each shard
- StatefulSet for SQLite persistence
- Services for each shard
- ConfigMaps for namespace configs (must sync manually!)
- Plugin sidecar injection (manual, error-prone)
With Operator:
# Just define:
apiVersion: prism.io/v1alpha1
kind: PrismNamespace
metadata:
name: my-namespace
spec:
backend: postgres
pattern: keyvalue
# Operator handles the rest!
Compared to Alternatives
vs Helm Charts:
- ✅ Operator is dynamic (watches for changes, reconciles)
- ✅ Operator can query Prism API for current state
- ❌ Helm is static (install/upgrade only)
- Use both: Operator installed via Helm, then manages CRDs
vs Manual kubectl:
- ✅ Operator enforces best practices
- ✅ Operator handles complex workflows (rolling updates, health checks)
- ❌ kubectl requires manual orchestration
- Operator wins for production deployments
vs External Tool (Ansible, Terraform):
- ✅ Operator is Kubernetes-native (no external dependencies)
- ✅ Operator continuously reconciles (self-healing)
- ❌ External tools are one-shot (no continuous reconciliation)
- Operator preferred for Kubernetes environments
Alternatives Considered
1. Helm Charts Only
- Pros: Simpler, no custom code
- Cons: No dynamic reconciliation, can't query Prism state
- Rejected because: Doesn't scale operationally (manual config sync)
2. GitOps (ArgoCD/Flux) Without Operator
- Pros: Declarative, Git as source of truth
- Cons: Still need to manage low-level Kubernetes resources manually
- Partially accepted: Use GitOps + Operator (ArgoCD applies CRDs, operator reconciles)
3. Serverless Functions (AWS Lambda, CloudRun)
- Pros: No Kubernetes needed
- Cons: Stateful config management harder, no standard API
- Rejected because: Prism is Kubernetes-native, operator pattern is standard
Consequences
Positive
- Declarative Management:
kubectl apply namespace.yamlcreates namespace across all shards - GitOps Ready: CRDs in Git → ArgoCD applies → Operator reconciles
- Self-Healing: Operator detects drift and corrects it
- Reduced Ops Burden: No manual config sync, deployment orchestration
- Type Safety: CRDs are schema-validated by Kubernetes API server
- Extensibility: Easy to add new CRDs (e.g.,
PrismMigrationfor shadow traffic automation)
Negative
- Operator Complexity: Must maintain operator code (Rust + kube-rs or Go + controller-runtime)
- Kubernetes Dependency: Prism is now tightly coupled to Kubernetes (but can still run standalone)
- Learning Curve: Operators require understanding of reconciliation loops, watches, caching
Neutral
- CRD Versioning: Must handle API versioning (v1alpha1 → v1beta1 → v1) over time
- RBAC: Operator needs permissions to create/update Deployments, Services, etc.
- Observability: Operator needs its own metrics, logging, tracing
Implementation Details
Technology Stack (Implemented)
Language: Go with kubebuilder/controller-runtime
- Chosen for mature Kubernetes ecosystem and extensive examples
- Built using kubebuilder scaffolding
- Comprehensive testing with envtest
Actual Project Structure
prism-operator/
├── Makefile # Build, test, deploy targets
├── go.mod # Go dependencies
├── cmd/
│ └── manager/
│ └── main.go # Operator entry point
├── api/
│ └── v1alpha1/
│ ├── prismpattern_types.go # ✅ Implemented
│ ├── prismstack_types.go # Partially implemented
│ └── groupversion_info.go
├── controllers/
│ └── prismpattern_controller.go # ✅ Fully implemented
├── pkg/
│ └── autoscaling/
│ ├── hpa.go # HPA reconciliation
│ └─ ─ keda.go # KEDA ScaledObject reconciliation
├── config/
│ ├── crd/ # Generated CRD manifests
│ │ └── bases/
│ │ └── prism.io_prismpatterns.yaml
│ ├── rbac/ # RBAC manifests
│ ├── manager/ # Operator deployment
│ └── samples/ # Example patterns
│ ├── prismpattern_hpa_example.yaml
│ ├── prismpattern_keda_kafka_example.yaml
│ └── prismpattern_keda_multi_trigger_example.yaml
├── scripts/
│ └── install-keda.sh # KEDA installation helper
├── README.md # Complete operator documentation
├── QUICK_START.md # 5-minute quickstart guide
├── KEDA_INSTALL_GUIDE.md # KEDA installation details
└── TEST_REPORT.md # Test results and validation
Key Features Implemented
✅ PrismPattern Controller
- Reconciles Deployment, Service, HPA/KEDA
- Status tracking with phases and conditions
- Owner references for cascading deletes
✅ HPA Auto-Scaling
- CPU and memory-based scaling
- Custom metrics support (Prometheus, etc.)
- Scaling behavior policies
✅ KEDA Auto-Scaling
- 60+ supported scalers (Kafka, NATS, SQS, Redis, etc.)
- Multi-trigger support
- Polling interval and cooldown configuration
- Authentication via TriggerAuthentication
✅ Placement Control
- Node selectors
- Affinity/anti-affinity rules
- Tolerations
- Topology spread constraints
Quick Start (5 minutes)
cd prism-operator
# 1. Install CRDs
make install
# 2. Install dependencies (metrics-server + KEDA)
make local-install-deps
# 3. Run operator locally
make local-run
# 4. Deploy a pattern (in another terminal)
kubectl apply -f config/samples/prismpattern_hpa_example.yaml
# 5. Watch auto-scaling
kubectl get prismpattern -w
kubectl get hpa -w
See prism-operator/QUICK_START.md for complete instructions.
Testing
# Run unit tests
make test
# Run with coverage
make test-coverage
# Local development workflow
make local-install-deps # Install metrics-server + KEDA
make local-run # Run operator
make local-test-hpa # Test HPA example
make local-test-keda # Test KEDA example
make local-status # Show all resources
make local-clean # Clean up
References
- Kubernetes Operator Pattern
- Kubebuilder Documentation
- KEDA Documentation
- KEDA Scalers - 60+ supported event sources
- MEMO-036 - Comprehensive operator architecture
- prism-operator/README.md - Operator documentation
- prism-operator/QUICK_START.md - 5-minute quickstart
- prism-operator/KEDA_INSTALL_GUIDE.md - KEDA setup guide
- ADR-034: Product/Feature Sharding (future integration)
Implementation Status
| Component | Status | Notes |
|---|---|---|
| PrismPattern CRD | ✅ Implemented | Full reconciliation loop, HPA/KEDA support |
| PrismStack CRD | ⚠️ Partial | Types defined, controller not implemented |
| HPA Auto-Scaling | ✅ Implemented | CPU/memory + custom metrics |
| KEDA Auto-Scaling | ✅ Implemented | 60+ scalers, multi-trigger support |
| Placement Control | ✅ Implemented | NodeSelector, affinity, tolerations |
| Status Tracking | ✅ Implemented | Phase progression, conditions |
| RBAC | ✅ Implemented | Minimal required permissions |
| Documentation | ✅ Complete | README, quickstart, KEDA guide, tests |
Next Steps
- Implement PrismStack Controller - Full stack orchestration
- Add PrismBackend CRD - Backend connection configuration
- Add PrismNamespace CRD - Multi-tenant namespace provisioning
- Integrate with Prism Admin API - Dynamic namespace creation
- Production Deployment - Helm chart, CI/CD integration
- Observability - Operator metrics, tracing
Revision History
- 2025-10-08: Initial draft proposing Kubernetes Operator with CRDs
- 2025-10-19: Updated with actual implementation details (status: Implemented)