Operations
Read time: 10 minutes
Daily Operations
Check Namespace Health
# View overall status
prism namespace status user-events
# Output:
# Namespace: user-events
# Status: ✓ healthy
# Uptime: 45 days
# Last restart: 2025-09-04 10:30:00
Monitor Performance
# Real-time metrics
prism namespace metrics user-events --watch
# Output (updates every 5 seconds):
# Latency: P50=0.8ms P99=3.2ms
# Throughput: 850 writes/sec, 4200 reads/sec
# Errors: 0.01% (2 errors in last minute)
# Backend: healthy
View Recent Errors
# Show errors from last hour
prism namespace errors user-events --last 1h
# Output:
# 2025-10-19 10:30:00 TimeoutError: Backend timeout after 5s
# 2025-10-19 10:31:15 ValidationError: Message exceeds size limit
Capacity Management
Scale Up Capacity
When to scale:
- Write/read RPS approaching 80% of capacity
- Latency P99 increasing
- Backend resource utilization high
How to scale:
# Check current capacity utilization
prism namespace status user-events
# Output:
# Write RPS: 4200 / 5000 (84% utilization) ← Near limit!
# Read RPS: 8500 / 10000 (85% utilization)
# Increase capacity
prism namespace update user-events \
--write-rps 10000 \
--read-rps 20000
# Prism automatically:
# - Increases backend partitions/shards
# - Adjusts connection pools
# - Updates resource allocation
# - Zero downtime migration
Scale Down Capacity
When to scale down:
- Consistently using <50% of capacity
- Reduce costs
- Seasonal traffic decrease
# Scale down gradually
prism namespace update user-events \
--write-rps 2000 \
--read-rps 5000
# Monitor for 24 hours before further reduction
Best practice: Scale down in steps (don't go from 10K to 1K immediately).
Troubleshooting
High Latency
Symptom: Operations taking longer than expected
Diagnosis:
# Check latency percentiles
prism namespace metrics user-events
# Look for:
# P99 > 50ms: Backend overloaded or network issues
# P50 high but P99 ok: Normal operation
# Sudden spike: Check for errors
Common causes:
-
Backend overload:
# Check capacity utilization
prism namespace status user-events
# If >80%, scale up -
Network issues:
# Check backend connectivity
prism backend health kafka-cluster-1 -
Large messages:
# Check message sizes
prism namespace stats user-events --message-sizes
# Consider enabling Claim Check pattern
Data Not Appearing
Symptom: Published data not visible to consumers
Diagnosis:
# Check recent writes
prism namespace stats user-events --last 5m
# Output:
# Writes: 1234 (all acknowledged)
# Reads: 0 ← No consumers!
# Check consumer status
prism namespace consumers user-events
# Output:
# Consumer: analytics-api
# Status: disconnected ← Problem found!
# Last seen: 2025-10-19 09:00:00 (90 minutes ago)
Common causes:
-
Consumer not running:
# Verify consumer service is running
kubectl get pods -l app=analytics-api -
Consumer lag:
# Check consumer lag
prism namespace consumers user-events --lag
# Output:
# Consumer: analytics-api
# Lag: 50000 messages ← Falling behind!
# Processing rate: 100/sec
# Time to catch up: ~8 minutes -
Access permissions:
# Verify consumer has access
prism namespace access user-events
# If missing, grant access:
prism namespace grant user-events \
--service analytics-api \
--permissions read
Backend Errors
Symptom: Errors in logs or metrics
Diagnosis:
# View error details
prism namespace errors user-events --last 1h --verbose
# Output:
# 2025-10-19 10:30:00 ERROR
# Type: BackendTimeoutError
# Message: Kafka broker timeout after 5s
# Backend: kafka-cluster-1
# Retry: 3/3 (failed)
Common causes:
-
Backend unavailable:
# Check backend health
prism backend health kafka-cluster-1
# Output:
# Status: degraded
# Broker 1: healthy
# Broker 2: unhealthy ← Problem!
# Broker 3: healthy -
Network partition:
# Check connectivity
prism backend connectivity kafka-cluster-1
# Test from Prism proxy to backend -
Backend overload:
# Check backend metrics
prism backend metrics kafka-cluster-1
# Look for high CPU, memory, disk I/O
Data Management
Adjust Retention
# Increase retention (keeps more data)
prism namespace update user-events --retention 30days
# Decrease retention (saves storage)
prism namespace update user-events --retention 7days
# Prism automatically:
# - Updates backend retention policies
# - Schedules data cleanup
# - Adjusts storage allocation
Manual Data Cleanup
# Delete data older than specific date
prism namespace cleanup user-events \
--before 2025-09-01
# Delete all data (careful!)
prism namespace purge user-events --confirm
# Prism will ask for confirmation:
# WARNING: This will delete ALL data in namespace user-events
# Type namespace name to confirm: user-events
Export Data
# Export to file
prism namespace export user-events \
--start 2025-10-01 \
--end 2025-10-19 \
--format jsonl \
--output user-events-export.jsonl
# Export to S3
prism namespace export user-events \
--start 2025-10-01 \
--end 2025-10-19 \
--format parquet \
--output s3://backups/user-events/
Access Management
Grant Access
# Grant read-write access
prism namespace grant user-events \
--service new-consumer \
--permissions read,write
# Grant read-only access
prism namespace grant user-events \
--service reporting-dashboard \
--permissions read
# Grant admin access
prism namespace grant user-events \
--team platform-team \
--permissions admin
Revoke Access
# Revoke all access
prism namespace revoke user-events \
--service old-consumer
# Revoke specific permissions
prism namespace revoke user-events \
--service analytics-api \
--permissions write
View Access
# List all access grants
prism namespace access user-events
# Output:
# Owners:
# - analytics-team (admin)
#
# Consumers:
# - analytics-api (read, write)
# - reporting-dashboard (read)
# - data-pipeline (write)
Alerting
Configure Alerts
# alerts.yaml
namespace: user-events
alerts:
- name: high-latency
condition: p99_latency > 50ms
duration: 5m
severity: warning
- name: high-error-rate
condition: error_rate > 1%
duration: 1m
severity: critical
- name: capacity-limit
condition: write_rps > 0.9 * capacity
duration: 10m
severity: warning
- name: consumer-lag
condition: consumer_lag > 10000
duration: 5m
severity: warning
# Apply alerts
prism namespace alerts create user-events alerts.yaml
# View active alerts
prism namespace alerts list user-events
# Silence alert temporarily
prism namespace alerts silence user-events \
--alert high-latency \
--duration 1h
Backup and Recovery
Create Backup
# Full namespace backup
prism namespace backup user-events \
--output s3://backups/user-events/2025-10-19/
# Incremental backup (only new data)
prism namespace backup user-events \
--incremental \
--since 2025-10-18 \
--output s3://backups/user-events/incremental/
Restore from Backup
# Restore entire namespace
prism namespace restore user-events \
--source s3://backups/user-events/2025-10-19/ \
--confirm
# Restore specific time range
prism namespace restore user-events \
--source s3://backups/user-events/2025-10-19/ \
--start 2025-10-19T10:00:00Z \
--end 2025-10-19T12:00:00Z
Point-in-Time Recovery
# Restore to specific timestamp
prism namespace restore user-events \
--timestamp 2025-10-19T11:30:00Z
# Prism restores data to exactly that point
Migration Operations
Backend Migration
Scenario: Migrate from NATS to Kafka (zero downtime)
# Step 1: Enable shadow traffic
prism namespace migrate user-events \
--target-backend kafka \
--mode shadow
# Prism now:
# - Writes to NATS (primary)
# - Writes to Kafka (shadow, not used for reads)
# - Compares results
# Step 2: Monitor shadow traffic (24-48 hours)
prism namespace migrate-status user-events
# Output:
# Migration: NATS → Kafka
# Mode: shadow
# Duration: 36 hours
# Write success rate: 99.99% (both backends)
# Comparison: 100% match
# Step 3: Switch reads to Kafka
prism namespace migrate user-events \
--target-backend kafka \
--mode read-from-new
# Prism now:
# - Writes to both NATS and Kafka
# - Reads from Kafka only
# Step 4: Monitor (24 hours)
# Step 5: Complete migration
prism namespace migrate user-events \
--target-backend kafka \
--mode complete
# Prism now:
# - Writes to Kafka only
# - Reads from Kafka only
# - NATS deprovisioned
Rollback Migration
# If issues detected, rollback immediately
prism namespace migrate user-events --rollback
# Prism reverts to previous backend (NATS)
Performance Optimization
Enable Caching
# Enable read cache
prism namespace update user-events \
--cache-enabled \
--cache-ttl 60s
# Check cache hit rate
prism namespace metrics user-events --cache
# Output:
# Cache hit rate: 85%
# Cache size: 1.2GB / 2GB
# Evictions: 1000/sec
Tune Connection Pools
# Increase connection pool (high concurrency)
prism namespace update user-events \
--connection-pool-size 100
# Decrease connection pool (low concurrency)
prism namespace update user-events \
--connection-pool-size 20
Compression
# Enable compression (reduce bandwidth)
prism namespace update user-events \
--compression gzip
# Check compression ratio
prism namespace stats user-events --compression
# Output:
# Original size: 10GB
# Compressed size: 2GB
# Compression ratio: 5:1
Monitoring Dashboards
View Dashboard
# Open Grafana dashboard
prism namespace dashboard user-events
# Opens browser to:
# http://grafana.local/d/prism-user-events
Dashboard Panels
Default dashboard includes:
- Throughput: Writes and reads per second
- Latency: P50, P95, P99 percentiles
- Errors: Error rate and error types
- Backend health: Backend status and connectivity
- Consumer lag: Per-consumer lag metrics
- Resource usage: CPU, memory, storage
Best Practices
Regular Checks
Daily:
- Check namespace status
- Review error logs
- Monitor capacity utilization
Weekly:
- Review capacity trends
- Check consumer lag
- Validate alert configuration
Monthly:
- Review retention policies
- Audit access permissions
- Test backup/restore procedures
Capacity Planning
- Monitor trends: Use 30-day rolling average for capacity planning
- Scale proactively: Scale at 70% utilization (don't wait for 100%)
- Test scaling: Validate scaling in staging before production
Incident Response
When alerts fire:
- Acknowledge alert to prevent duplicate notifications
- Check namespace status for quick diagnosis
- Review recent changes (deployments, config updates)
- Check backend health (external dependencies)
- Scale resources if capacity-related
- Engage platform team if backend issues
Getting Help
Support Channels
# Get help with command
prism namespace --help
# Check system status
prism status
# Contact platform team
# (internal support channel)
Useful Commands
# Summary of all namespaces
prism namespace list --status
# Detailed namespace info
prism namespace show user-events
# Recent activity
prism namespace activity user-events --last 24h
# Export metrics for analysis
prism namespace metrics user-events \
--start 2025-10-01 \
--end 2025-10-19 \
--format csv \
--output metrics.csv
Key Takeaways
- Monitor proactively: Check status daily, review trends weekly
- Scale before limits: Act at 70-80% capacity, not 100%
- Test operations: Validate backup/restore, scaling in staging
- Document incidents: Track issues and resolutions
- Automate alerts: Set up alerting for critical metrics
Next Steps
- Getting Started - Review basics
- Core Concepts - Understand architecture
- Architecture Guide - Deep technical dive