operationsdeploymentkubernetesautomationdxautoscalingkeda

Status: ImplementedAuthor: Jacob ReppDeciders: SystemCreated: Oct 10, 2025Updated: Nov 15, 2025

Kubernetes Operator with Custom Resource Definitions

Context

Managing Prism deployments at scale requires automation for:

Namespace Lifecycle: Creating, updating, deleting namespaces across multiple Prism instances
Shard Management: Deploying product/feature-based shards (ADR-034)
Plugin Installation: Distributing plugins across instances
Configuration Sync: Keeping namespace configs consistent across replicas
Resource Management: CPU/memory limits, autoscaling, health checks

Manual Management Pain Points

Without automation:

YAML Hell: Manually maintaining hundreds of namespace config files
Deployment Complexity: kubectl apply across multiple files, error-prone
Inconsistency: Config drift between Prism instances
No GitOps: Can't declaratively manage Prism infrastructure as code
Slow Iteration: Namespace changes require manual updates to multiple instances

Kubernetes Operator Pattern

Operators extend Kubernetes with custom logic to manage applications:

CRDs (Custom Resource Definitions): Define custom resources (e.g., PrismNamespace)
Controller: Watches CRDs, reconciles desired state → actual state
Declarative: Describe what you want, operator figures out how

Examples: PostgreSQL Operator, Kafka Operator, Istio Operator

Decision

Build a Prism Kubernetes Operator that manages Prism deployments via Custom Resource Definitions (CRDs).

Status: ✅ IMPLEMENTED - The operator is fully implemented with auto-scaling support via HPA and KEDA. See prism-operator/ directory.

Implemented Custom Resources

The operator currently implements two primary CRDs:

1. PrismPattern (Implemented)

Individual pattern runner with auto-scaling support:

apiVersion: prism.io/v1alpha1
kind: PrismPattern
metadata:
  name: consumer-kafka-orders
spec:
  pattern: consumer
  backend: kafka
  image: ghcr.io/prism/consumer-runner:latest
  replicas: 2

  resources:
    requests:
      cpu: "500m"
      memory: "1Gi"
    limits:
      cpu: "2000m"
      memory: "4Gi"

  service:
    type: ClusterIP
    port: 8080

  # HPA or KEDA auto-scaling
  autoscaling:
    enabled: true
    scaler: keda  # or "hpa"
    minReplicas: 2
    maxReplicas: 50
    pollingInterval: 10
    cooldownPeriod: 300

    triggers:
      - type: kafka
        metadata:
          bootstrapServers: "kafka:9092"
          consumerGroup: "prism-orders"
          topic: "orders"
          lagThreshold: "1000"

  placement:
    nodeSelector:
      workload-type: compute-intensive

status:
created: 2025-10-10
updated: 2025-11-15
  phase: Running  # Pending → Progressing → Running
  replicas: 5
  availableReplicas: 5
  conditions:
    - type: Ready
      status: "True"
created: 2025-10-10
updated: 2025-11-15

2. PrismStack (Partially Implemented)

Complete Prism deployment including proxy, admin, patterns, and backends:

apiVersion: prism.io/v1alpha1
kind: PrismStack
metadata:
  name: production-stack
spec:
  proxy:
    image: ghcr.io/prism/prism-proxy:latest
    replicas: 5
    port: 8980
    autoscaling:
      enabled: true
      scaler: hpa
      minReplicas: 5
      maxReplicas: 20
      targetCPUUtilizationPercentage: 75

  admin:
    enabled: true
    port: 8981
    replicas: 3
    leaderElection:
      enabled: true

  patterns:
    - name: orders-consumer
      type: consumer
      backend: kafka
      replicas: 10

  backends:
    - name: kafka-prod
      type: kafka
      connectionString: "kafka.prod.svc:9092"

status:
created: 2025-10-10
updated: 2025-11-15
  phase: Running
  conditions:
    - type: Ready
      status: "True"
created: 2025-10-10
updated: 2025-11-15

Note: PrismStack CRD is defined but the controller is not yet fully implemented. Focus is on PrismPattern which is production-ready.

Future CRDs (Planned)

The following CRDs are planned for future implementation:

PrismShard: Product/feature-based sharding (from ADR-034)
PrismPlugin: Plugin lifecycle management
PrismNamespace: Multi-tenant namespace provisioning
PrismBackend: Backend connection configuration

Operator Architecture (Implemented)

┌──────────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                         │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │          Prism Operator (Controller)                   │ │
│  │                                                        │ │
│  │  Watches:                                              │ │
│  │  - PrismPattern CRDs                                   │ │
│  │  - PrismStack CRDs (partial)                           │ │
│  │                                                        │ │
│  │  Reconciles:                                           │ │
│  │  1. Creates/updates Deployment                         │ │
│  │  2. Creates/updates Service                            │ │
│  │  3. Creates HPA (CPU/memory) or KEDA ScaledObject     │ │
│  │  4. Updates status with phase and conditions           │ │
│  └────────────────────────────────────────────────────────┘ │
│           │                                                  │
│           ▼                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │         PrismPattern: consumer-kafka-orders          │   │
│  │                                                      │   │
│  │  Deployment                Service                  │   │
│  │  ┌─────────┐              ┌────────┐                │   │
│  │  │ Pod 1   │              │ :8080  │                │   │
│  │  │ Pod 2   │──────────────│        │                │   │
│  │  │ ...     │              └────────┘                │   │
│  │  │ Pod N   │                                        │   │
│  │  └─────────┘                                        │   │
│  │       │                                             │   │
│  │       └─── Scaled by: HPA or KEDA ScaledObject     │   │
│  │                                                      │   │
│  │  Auto-Scaling Triggers:                             │   │
│  │  - CPU/Memory (HPA)                                 │   │
│  │  - Kafka lag (KEDA)                                 │   │
│  │  - NATS queue depth (KEDA)                          │   │
│  │  - SQS queue length (KEDA)                          │   │
│  │  - 60+ other KEDA scalers                           │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘

Reconciliation Logic (Implemented)

When a PrismPattern is created or updated:

Reconcile Deployment:
- Create or update Deployment with specified image, replicas, resources
- Apply placement constraints (nodeSelector, affinity, tolerations)
- Set owner reference for garbage collection
Reconcile Service (if spec.service is defined):
- Create or update Service with specified type and port
- Add labels for pattern identification
Reconcile Auto-Scaling (if spec.autoscaling.enabled == true):
- HPA Mode (scaler: hpa):
  - Create HorizontalPodAutoscaler with CPU/memory targets
  - Apply scaling behavior policies if specified
- KEDA Mode (scaler: keda):
  - Create KEDA ScaledObject with triggers (Kafka, NATS, SQS, etc.)
  - Configure polling interval and cooldown period
  - Support multiple triggers simultaneously
Update Status:
- Set phase: Pending → Progressing → Running
- Update replica counts from Deployment
- Add Ready condition with message

Supported KEDA Scalers:

Kafka (consumer lag)
NATS JetStream (pending messages)
AWS SQS (queue depth)
RabbitMQ (queue length)
Redis (list/stream length)
PostgreSQL (custom queries)
60+ more scalers

See prism-operator/README.md for complete documentation.

Rationale

Why Custom Operator vs Raw Kubernetes?

Without Operator (raw Kubernetes manifests):

# Must manually define:
- Deployment for each shard
- StatefulSet for SQLite persistence
- Services for each shard
- ConfigMaps for namespace configs (must sync manually!)
- Plugin sidecar injection (manual, error-prone)

With Operator:

# Just define:
apiVersion: prism.io/v1alpha1
kind: PrismNamespace
metadata:
  name: my-namespace
spec:
  backend: postgres
  pattern: keyvalue
# Operator handles the rest!

Compared to Alternatives

vs Helm Charts:

✅ Operator is dynamic (watches for changes, reconciles)
✅ Operator can query Prism API for current state
❌ Helm is static (install/upgrade only)
Use both: Operator installed via Helm, then manages CRDs

vs Manual kubectl:

✅ Operator enforces best practices
✅ Operator handles complex workflows (rolling updates, health checks)
❌ kubectl requires manual orchestration
Operator wins for production deployments

vs External Tool (Ansible, Terraform):

✅ Operator is Kubernetes-native (no external dependencies)
✅ Operator continuously reconciles (self-healing)
❌ External tools are one-shot (no continuous reconciliation)
Operator preferred for Kubernetes environments

Alternatives Considered

1. Helm Charts Only

Pros: Simpler, no custom code
Cons: No dynamic reconciliation, can't query Prism state
Rejected because: Doesn't scale operationally (manual config sync)

2. GitOps (ArgoCD/Flux) Without Operator

Pros: Declarative, Git as source of truth
Cons: Still need to manage low-level Kubernetes resources manually
Partially accepted: Use GitOps + Operator (ArgoCD applies CRDs, operator reconciles)

3. Serverless Functions (AWS Lambda, CloudRun)

Pros: No Kubernetes needed
Cons: Stateful config management harder, no standard API
Rejected because: Prism is Kubernetes-native, operator pattern is standard

Consequences

Positive

Declarative Management: kubectl apply namespace.yaml creates namespace across all shards
GitOps Ready: CRDs in Git → ArgoCD applies → Operator reconciles
Self-Healing: Operator detects drift and corrects it
Reduced Ops Burden: No manual config sync, deployment orchestration
Type Safety: CRDs are schema-validated by Kubernetes API server
Extensibility: Easy to add new CRDs (e.g., PrismMigration for shadow traffic automation)

Negative

Operator Complexity: Must maintain operator code (Rust + kube-rs or Go + controller-runtime)
Kubernetes Dependency: Prism is now tightly coupled to Kubernetes (but can still run standalone)
Learning Curve: Operators require understanding of reconciliation loops, watches, caching

Neutral

CRD Versioning: Must handle API versioning (v1alpha1 → v1beta1 → v1) over time
RBAC: Operator needs permissions to create/update Deployments, Services, etc.
Observability: Operator needs its own metrics, logging, tracing

Implementation Details

Technology Stack (Implemented)

Language: Go with kubebuilder/controller-runtime

Chosen for mature Kubernetes ecosystem and extensive examples
Built using kubebuilder scaffolding
Comprehensive testing with envtest

Actual Project Structure

prism-operator/
├── Makefile                    # Build, test, deploy targets
├── go.mod                      # Go dependencies
├── cmd/
│   └── manager/
│       └── main.go             # Operator entry point
├── api/
│   └── v1alpha1/
│       ├── prismpattern_types.go      # ✅ Implemented
│       ├── prismstack_types.go        # Partially implemented
│       └── groupversion_info.go
├── controllers/
│   └── prismpattern_controller.go     # ✅ Fully implemented
├── pkg/
│   └── autoscaling/
│       ├── hpa.go              # HPA reconciliation
│       └── keda.go             # KEDA ScaledObject reconciliation
├── config/
│   ├── crd/                    # Generated CRD manifests
│   │   └── bases/
│   │       └── prism.io_prismpatterns.yaml
│   ├── rbac/                   # RBAC manifests
│   ├── manager/                # Operator deployment
│   └── samples/                # Example patterns
│       ├── prismpattern_hpa_example.yaml
│       ├── prismpattern_keda_kafka_example.yaml
│       └── prismpattern_keda_multi_trigger_example.yaml
├── scripts/
│   └── install-keda.sh         # KEDA installation helper
├── README.md                   # Complete operator documentation
├── QUICK_START.md              # 5-minute quickstart guide
├── KEDA_INSTALL_GUIDE.md       # KEDA installation details
└── TEST_REPORT.md              # Test results and validation

Key Features Implemented

✅ PrismPattern Controller

Reconciles Deployment, Service, HPA/KEDA
Status tracking with phases and conditions
Owner references for cascading deletes

✅ HPA Auto-Scaling

CPU and memory-based scaling
Custom metrics support (Prometheus, etc.)
Scaling behavior policies

✅ KEDA Auto-Scaling

60+ supported scalers (Kafka, NATS, SQS, Redis, etc.)
Multi-trigger support
Polling interval and cooldown configuration
Authentication via TriggerAuthentication

✅ Placement Control

Node selectors
Affinity/anti-affinity rules
Tolerations
Topology spread constraints

Quick Start (5 minutes)

cd prism-operator

# 1. Install CRDs
make install

# 2. Install dependencies (metrics-server + KEDA)
make local-install-deps

# 3. Run operator locally
make local-run

# 4. Deploy a pattern (in another terminal)
kubectl apply -f config/samples/prismpattern_hpa_example.yaml

# 5. Watch auto-scaling
kubectl get prismpattern -w
kubectl get hpa -w

See prism-operator/QUICK_START.md for complete instructions.

Testing

# Run unit tests
make test

# Run with coverage
make test-coverage

# Local development workflow
make local-install-deps    # Install metrics-server + KEDA
make local-run             # Run operator
make local-test-hpa        # Test HPA example
make local-test-keda       # Test KEDA example
make local-status          # Show all resources
make local-clean           # Clean up

References

Kubernetes Operator Pattern
Kubebuilder Documentation
KEDA Documentation
KEDA Scalers - 60+ supported event sources
MEMO-036 - Comprehensive operator architecture
prism-operator/README.md - Operator documentation
prism-operator/QUICK_START.md - 5-minute quickstart
prism-operator/KEDA_INSTALL_GUIDE.md - KEDA setup guide
ADR-034: Product/Feature Sharding (future integration)

Implementation Status

Component	Status	Notes
PrismPattern CRD	✅ Implemented	Full reconciliation loop, HPA/KEDA support
PrismStack CRD	⚠️ Partial	Types defined, controller not implemented
HPA Auto-Scaling	✅ Implemented	CPU/memory + custom metrics
KEDA Auto-Scaling	✅ Implemented	60+ scalers, multi-trigger support
Placement Control	✅ Implemented	NodeSelector, affinity, tolerations
Status Tracking	✅ Implemented	Phase progression, conditions
RBAC	✅ Implemented	Minimal required permissions
Documentation	✅ Complete	README, quickstart, KEDA guide, tests

Next Steps

Implement PrismStack Controller - Full stack orchestration
Add PrismBackend CRD - Backend connection configuration
Add PrismNamespace CRD - Multi-tenant namespace provisioning
Integrate with Prism Admin API - Dynamic namespace creation
Production Deployment - Helm chart, CI/CD integration
Observability - Operator metrics, tracing

Revision History

2025-10-08: Initial draft proposing Kubernetes Operator with CRDs
2025-10-19: Updated with actual implementation details (status: Implemented)

Context​

Manual Management Pain Points​

Kubernetes Operator Pattern​

Decision​

Implemented Custom Resources​

1. PrismPattern (Implemented)​

2. PrismStack (Partially Implemented)​

Future CRDs (Planned)​

Operator Architecture (Implemented)​

Reconciliation Logic (Implemented)​

Rationale​

Why Custom Operator vs Raw Kubernetes?​

Compared to Alternatives​

Alternatives Considered​

1. Helm Charts Only​

2. GitOps (ArgoCD/Flux) Without Operator​

3. Serverless Functions (AWS Lambda, CloudRun)​

Consequences​

Positive​

Negative​

Neutral​

Implementation Details​

Technology Stack (Implemented)​

Actual Project Structure​

Key Features Implemented​

Quick Start (5 minutes)​

Testing​

References​

Implementation Status​

Next Steps​

Revision History​