Operations

Read time: 10 minutes

Daily Operations

Check Namespace Health

# View overall status
prism namespace status user-events

# Output:
# Namespace: user-events
# Status: ✓ healthy
# Uptime: 45 days
# Last restart: 2025-09-04 10:30:00

Monitor Performance

# Real-time metrics
prism namespace metrics user-events --watch

# Output (updates every 5 seconds):
# Latency: P50=0.8ms P99=3.2ms
# Throughput: 850 writes/sec, 4200 reads/sec
# Errors: 0.01% (2 errors in last minute)
# Backend: healthy

View Recent Errors

# Show errors from last hour
prism namespace errors user-events --last 1h

# Output:
# 2025-10-19 10:30:00 TimeoutError: Backend timeout after 5s
# 2025-10-19 10:31:15 ValidationError: Message exceeds size limit

Capacity Management

Scale Up Capacity

When to scale:

Write/read RPS approaching 80% of capacity
Latency P99 increasing
Backend resource utilization high

How to scale:

# Check current capacity utilization
prism namespace status user-events

# Output:
# Write RPS: 4200 / 5000 (84% utilization)  ← Near limit!
# Read RPS: 8500 / 10000 (85% utilization)

# Increase capacity
prism namespace update user-events \
  --write-rps 10000 \
  --read-rps 20000

# Prism automatically:
# - Increases backend partitions/shards
# - Adjusts connection pools
# - Updates resource allocation
# - Zero downtime migration

Scale Down Capacity

When to scale down:

Consistently using <50% of capacity
Reduce costs
Seasonal traffic decrease

# Scale down gradually
prism namespace update user-events \
  --write-rps 2000 \
  --read-rps 5000

# Monitor for 24 hours before further reduction

Best practice: Scale down in steps (don't go from 10K to 1K immediately).

Troubleshooting

High Latency

Symptom: Operations taking longer than expected

Diagnosis:

# Check latency percentiles
prism namespace metrics user-events

# Look for:
# P99 > 50ms: Backend overloaded or network issues
# P50 high but P99 ok: Normal operation
# Sudden spike: Check for errors

Common causes:

Backend overload:

# Check capacity utilization
prism namespace status user-events
# If >80%, scale up

Network issues:

# Check backend connectivity
prism backend health kafka-cluster-1

Large messages:

# Check message sizes
prism namespace stats user-events --message-sizes
# Consider enabling Claim Check pattern

Data Not Appearing

Symptom: Published data not visible to consumers

Diagnosis:

# Check recent writes
prism namespace stats user-events --last 5m

# Output:
# Writes: 1234 (all acknowledged)
# Reads: 0  ← No consumers!

# Check consumer status
prism namespace consumers user-events

# Output:
# Consumer: analytics-api
#   Status: disconnected  ← Problem found!
#   Last seen: 2025-10-19 09:00:00 (90 minutes ago)

Common causes:

Consumer not running:

# Verify consumer service is running
kubectl get pods -l app=analytics-api

Consumer lag:

# Check consumer lag
prism namespace consumers user-events --lag

# Output:
# Consumer: analytics-api
#   Lag: 50000 messages  ← Falling behind!
#   Processing rate: 100/sec
#   Time to catch up: ~8 minutes

Access permissions:

# Verify consumer has access
prism namespace access user-events

# If missing, grant access:
prism namespace grant user-events \
  --service analytics-api \
  --permissions read

Backend Errors

Symptom: Errors in logs or metrics

Diagnosis:

# View error details
prism namespace errors user-events --last 1h --verbose

# Output:
# 2025-10-19 10:30:00 ERROR
#   Type: BackendTimeoutError
#   Message: Kafka broker timeout after 5s
#   Backend: kafka-cluster-1
#   Retry: 3/3 (failed)

Common causes:

Backend unavailable:

# Check backend health
prism backend health kafka-cluster-1

# Output:
# Status: degraded
# Broker 1: healthy
# Broker 2: unhealthy  ← Problem!
# Broker 3: healthy

Network partition:

# Check connectivity
prism backend connectivity kafka-cluster-1

# Test from Prism proxy to backend

Backend overload:

# Check backend metrics
prism backend metrics kafka-cluster-1

# Look for high CPU, memory, disk I/O

Data Management

Adjust Retention

# Increase retention (keeps more data)
prism namespace update user-events --retention 30days

# Decrease retention (saves storage)
prism namespace update user-events --retention 7days

# Prism automatically:
# - Updates backend retention policies
# - Schedules data cleanup
# - Adjusts storage allocation

Manual Data Cleanup

# Delete data older than specific date
prism namespace cleanup user-events \
  --before 2025-09-01

# Delete all data (careful!)
prism namespace purge user-events --confirm

# Prism will ask for confirmation:
# WARNING: This will delete ALL data in namespace user-events
# Type namespace name to confirm: user-events

Export Data

# Export to file
prism namespace export user-events \
  --start 2025-10-01 \
  --end 2025-10-19 \
  --format jsonl \
  --output user-events-export.jsonl

# Export to S3
prism namespace export user-events \
  --start 2025-10-01 \
  --end 2025-10-19 \
  --format parquet \
  --output s3://backups/user-events/

Access Management

Grant Access

# Grant read-write access
prism namespace grant user-events \
  --service new-consumer \
  --permissions read,write

# Grant read-only access
prism namespace grant user-events \
  --service reporting-dashboard \
  --permissions read

# Grant admin access
prism namespace grant user-events \
  --team platform-team \
  --permissions admin

Revoke Access

# Revoke all access
prism namespace revoke user-events \
  --service old-consumer

# Revoke specific permissions
prism namespace revoke user-events \
  --service analytics-api \
  --permissions write

View Access

# List all access grants
prism namespace access user-events

# Output:
# Owners:
#   - analytics-team (admin)
#
# Consumers:
#   - analytics-api (read, write)
#   - reporting-dashboard (read)
#   - data-pipeline (write)

Alerting

Configure Alerts

# alerts.yaml
namespace: user-events
alerts:
  - name: high-latency
    condition: p99_latency > 50ms
    duration: 5m
    severity: warning

  - name: high-error-rate
    condition: error_rate > 1%
    duration: 1m
    severity: critical

  - name: capacity-limit
    condition: write_rps > 0.9 * capacity
    duration: 10m
    severity: warning

  - name: consumer-lag
    condition: consumer_lag > 10000
    duration: 5m
    severity: warning

# Apply alerts
prism namespace alerts create user-events alerts.yaml

# View active alerts
prism namespace alerts list user-events

# Silence alert temporarily
prism namespace alerts silence user-events \
  --alert high-latency \
  --duration 1h

Backup and Recovery

Create Backup

# Full namespace backup
prism namespace backup user-events \
  --output s3://backups/user-events/2025-10-19/

# Incremental backup (only new data)
prism namespace backup user-events \
  --incremental \
  --since 2025-10-18 \
  --output s3://backups/user-events/incremental/

Restore from Backup

# Restore entire namespace
prism namespace restore user-events \
  --source s3://backups/user-events/2025-10-19/ \
  --confirm

# Restore specific time range
prism namespace restore user-events \
  --source s3://backups/user-events/2025-10-19/ \
  --start 2025-10-19T10:00:00Z \
  --end 2025-10-19T12:00:00Z

Point-in-Time Recovery

# Restore to specific timestamp
prism namespace restore user-events \
  --timestamp 2025-10-19T11:30:00Z

# Prism restores data to exactly that point

Migration Operations

Backend Migration

Scenario: Migrate from NATS to Kafka (zero downtime)

# Step 1: Enable shadow traffic
prism namespace migrate user-events \
  --target-backend kafka \
  --mode shadow

# Prism now:
# - Writes to NATS (primary)
# - Writes to Kafka (shadow, not used for reads)
# - Compares results

# Step 2: Monitor shadow traffic (24-48 hours)
prism namespace migrate-status user-events

# Output:
# Migration: NATS → Kafka
# Mode: shadow
# Duration: 36 hours
# Write success rate: 99.99% (both backends)
# Comparison: 100% match

# Step 3: Switch reads to Kafka
prism namespace migrate user-events \
  --target-backend kafka \
  --mode read-from-new

# Prism now:
# - Writes to both NATS and Kafka
# - Reads from Kafka only

# Step 4: Monitor (24 hours)

# Step 5: Complete migration
prism namespace migrate user-events \
  --target-backend kafka \
  --mode complete

# Prism now:
# - Writes to Kafka only
# - Reads from Kafka only
# - NATS deprovisioned

Rollback Migration

# If issues detected, rollback immediately
prism namespace migrate user-events --rollback

# Prism reverts to previous backend (NATS)

Performance Optimization

Enable Caching

# Enable read cache
prism namespace update user-events \
  --cache-enabled \
  --cache-ttl 60s

# Check cache hit rate
prism namespace metrics user-events --cache

# Output:
# Cache hit rate: 85%
# Cache size: 1.2GB / 2GB
# Evictions: 1000/sec

Tune Connection Pools

# Increase connection pool (high concurrency)
prism namespace update user-events \
  --connection-pool-size 100

# Decrease connection pool (low concurrency)
prism namespace update user-events \
  --connection-pool-size 20

Compression

# Enable compression (reduce bandwidth)
prism namespace update user-events \
  --compression gzip

# Check compression ratio
prism namespace stats user-events --compression

# Output:
# Original size: 10GB
# Compressed size: 2GB
# Compression ratio: 5:1

Monitoring Dashboards

View Dashboard

# Open Grafana dashboard
prism namespace dashboard user-events

# Opens browser to:
# http://grafana.local/d/prism-user-events

Dashboard Panels

Default dashboard includes:

Throughput: Writes and reads per second
Latency: P50, P95, P99 percentiles
Errors: Error rate and error types
Backend health: Backend status and connectivity
Consumer lag: Per-consumer lag metrics
Resource usage: CPU, memory, storage

Best Practices

Regular Checks

Daily:

Check namespace status
Review error logs
Monitor capacity utilization

Weekly:

Review capacity trends
Check consumer lag
Validate alert configuration

Monthly:

Review retention policies
Audit access permissions
Test backup/restore procedures

Capacity Planning

Monitor trends: Use 30-day rolling average for capacity planning
Scale proactively: Scale at 70% utilization (don't wait for 100%)
Test scaling: Validate scaling in staging before production

Incident Response

When alerts fire:

Acknowledge alert to prevent duplicate notifications
Check namespace status for quick diagnosis
Review recent changes (deployments, config updates)
Check backend health (external dependencies)
Scale resources if capacity-related
Engage platform team if backend issues

Getting Help

Support Channels

# Get help with command
prism namespace --help

# Check system status
prism status

# Contact platform team
# (internal support channel)

Useful Commands

# Summary of all namespaces
prism namespace list --status

# Detailed namespace info
prism namespace show user-events

# Recent activity
prism namespace activity user-events --last 24h

# Export metrics for analysis
prism namespace metrics user-events \
  --start 2025-10-01 \
  --end 2025-10-19 \
  --format csv \
  --output metrics.csv

Key Takeaways

Monitor proactively: Check status daily, review trends weekly
Scale before limits: Act at 70-80% capacity, not 100%
Test operations: Validate backup/restore, scaling in staging
Document incidents: Track issues and resolutions
Automate alerts: Set up alerting for critical metrics

Next Steps

Getting Started - Review basics
Core Concepts - Understand architecture
Architecture Guide - Deep technical dive

Daily Operations​

Check Namespace Health​

Monitor Performance​

View Recent Errors​

Capacity Management​

Scale Up Capacity​

Scale Down Capacity​

Troubleshooting​

High Latency​

Data Not Appearing​

Backend Errors​

Data Management​

Adjust Retention​

Manual Data Cleanup​

Export Data​

Access Management​

Grant Access​

Revoke Access​

View Access​

Alerting​

Configure Alerts​

Backup and Recovery​

Create Backup​

Restore from Backup​

Point-in-Time Recovery​

Migration Operations​

Backend Migration​

Rollback Migration​

Performance Optimization​

Enable Caching​

Tune Connection Pools​

Compression​

Monitoring Dashboards​

View Dashboard​

Dashboard Panels​

Best Practices​

Regular Checks​

Capacity Planning​

Incident Response​

Getting Help​

Support Channels​

Useful Commands​

Key Takeaways​

Next Steps​

Daily Operations

Check Namespace Health

Monitor Performance

View Recent Errors

Capacity Management

Scale Up Capacity

Scale Down Capacity

Troubleshooting

High Latency

Data Not Appearing

Backend Errors

Data Management

Adjust Retention

Manual Data Cleanup

Export Data

Access Management

Grant Access

Revoke Access

View Access

Alerting

Configure Alerts

Backup and Recovery

Create Backup

Restore from Backup

Point-in-Time Recovery

Migration Operations

Backend Migration

Rollback Migration

Performance Optimization

Enable Caching

Tune Connection Pools

Compression

Monitoring Dashboards

View Dashboard

Dashboard Panels

Best Practices

Regular Checks

Capacity Planning

Incident Response

Getting Help

Support Channels

Useful Commands

Key Takeaways

Next Steps