Skip to main content

Operations

Read time: 10 minutes

Daily Operations

Check Namespace Health

# View overall status
prism namespace status user-events

# Output:
# Namespace: user-events
# Status: ✓ healthy
# Uptime: 45 days
# Last restart: 2025-09-04 10:30:00

Monitor Performance

# Real-time metrics
prism namespace metrics user-events --watch

# Output (updates every 5 seconds):
# Latency: P50=0.8ms P99=3.2ms
# Throughput: 850 writes/sec, 4200 reads/sec
# Errors: 0.01% (2 errors in last minute)
# Backend: healthy

View Recent Errors

# Show errors from last hour
prism namespace errors user-events --last 1h

# Output:
# 2025-10-19 10:30:00 TimeoutError: Backend timeout after 5s
# 2025-10-19 10:31:15 ValidationError: Message exceeds size limit

Capacity Management

Scale Up Capacity

When to scale:

  • Write/read RPS approaching 80% of capacity
  • Latency P99 increasing
  • Backend resource utilization high

How to scale:

# Check current capacity utilization
prism namespace status user-events

# Output:
# Write RPS: 4200 / 5000 (84% utilization) ← Near limit!
# Read RPS: 8500 / 10000 (85% utilization)

# Increase capacity
prism namespace update user-events \
--write-rps 10000 \
--read-rps 20000

# Prism automatically:
# - Increases backend partitions/shards
# - Adjusts connection pools
# - Updates resource allocation
# - Zero downtime migration

Scale Down Capacity

When to scale down:

  • Consistently using <50% of capacity
  • Reduce costs
  • Seasonal traffic decrease
# Scale down gradually
prism namespace update user-events \
--write-rps 2000 \
--read-rps 5000

# Monitor for 24 hours before further reduction

Best practice: Scale down in steps (don't go from 10K to 1K immediately).

Troubleshooting

High Latency

Symptom: Operations taking longer than expected

Diagnosis:

# Check latency percentiles
prism namespace metrics user-events

# Look for:
# P99 > 50ms: Backend overloaded or network issues
# P50 high but P99 ok: Normal operation
# Sudden spike: Check for errors

Common causes:

  1. Backend overload:

    # Check capacity utilization
    prism namespace status user-events
    # If >80%, scale up
  2. Network issues:

    # Check backend connectivity
    prism backend health kafka-cluster-1
  3. Large messages:

    # Check message sizes
    prism namespace stats user-events --message-sizes
    # Consider enabling Claim Check pattern

Data Not Appearing

Symptom: Published data not visible to consumers

Diagnosis:

# Check recent writes
prism namespace stats user-events --last 5m

# Output:
# Writes: 1234 (all acknowledged)
# Reads: 0 ← No consumers!

# Check consumer status
prism namespace consumers user-events

# Output:
# Consumer: analytics-api
# Status: disconnected ← Problem found!
# Last seen: 2025-10-19 09:00:00 (90 minutes ago)

Common causes:

  1. Consumer not running:

    # Verify consumer service is running
    kubectl get pods -l app=analytics-api
  2. Consumer lag:

    # Check consumer lag
    prism namespace consumers user-events --lag

    # Output:
    # Consumer: analytics-api
    # Lag: 50000 messages ← Falling behind!
    # Processing rate: 100/sec
    # Time to catch up: ~8 minutes
  3. Access permissions:

    # Verify consumer has access
    prism namespace access user-events

    # If missing, grant access:
    prism namespace grant user-events \
    --service analytics-api \
    --permissions read

Backend Errors

Symptom: Errors in logs or metrics

Diagnosis:

# View error details
prism namespace errors user-events --last 1h --verbose

# Output:
# 2025-10-19 10:30:00 ERROR
# Type: BackendTimeoutError
# Message: Kafka broker timeout after 5s
# Backend: kafka-cluster-1
# Retry: 3/3 (failed)

Common causes:

  1. Backend unavailable:

    # Check backend health
    prism backend health kafka-cluster-1

    # Output:
    # Status: degraded
    # Broker 1: healthy
    # Broker 2: unhealthy ← Problem!
    # Broker 3: healthy
  2. Network partition:

    # Check connectivity
    prism backend connectivity kafka-cluster-1

    # Test from Prism proxy to backend
  3. Backend overload:

    # Check backend metrics
    prism backend metrics kafka-cluster-1

    # Look for high CPU, memory, disk I/O

Data Management

Adjust Retention

# Increase retention (keeps more data)
prism namespace update user-events --retention 30days

# Decrease retention (saves storage)
prism namespace update user-events --retention 7days

# Prism automatically:
# - Updates backend retention policies
# - Schedules data cleanup
# - Adjusts storage allocation

Manual Data Cleanup

# Delete data older than specific date
prism namespace cleanup user-events \
--before 2025-09-01

# Delete all data (careful!)
prism namespace purge user-events --confirm

# Prism will ask for confirmation:
# WARNING: This will delete ALL data in namespace user-events
# Type namespace name to confirm: user-events

Export Data

# Export to file
prism namespace export user-events \
--start 2025-10-01 \
--end 2025-10-19 \
--format jsonl \
--output user-events-export.jsonl

# Export to S3
prism namespace export user-events \
--start 2025-10-01 \
--end 2025-10-19 \
--format parquet \
--output s3://backups/user-events/

Access Management

Grant Access

# Grant read-write access
prism namespace grant user-events \
--service new-consumer \
--permissions read,write

# Grant read-only access
prism namespace grant user-events \
--service reporting-dashboard \
--permissions read

# Grant admin access
prism namespace grant user-events \
--team platform-team \
--permissions admin

Revoke Access

# Revoke all access
prism namespace revoke user-events \
--service old-consumer

# Revoke specific permissions
prism namespace revoke user-events \
--service analytics-api \
--permissions write

View Access

# List all access grants
prism namespace access user-events

# Output:
# Owners:
# - analytics-team (admin)
#
# Consumers:
# - analytics-api (read, write)
# - reporting-dashboard (read)
# - data-pipeline (write)

Alerting

Configure Alerts

# alerts.yaml
namespace: user-events
alerts:
- name: high-latency
condition: p99_latency > 50ms
duration: 5m
severity: warning

- name: high-error-rate
condition: error_rate > 1%
duration: 1m
severity: critical

- name: capacity-limit
condition: write_rps > 0.9 * capacity
duration: 10m
severity: warning

- name: consumer-lag
condition: consumer_lag > 10000
duration: 5m
severity: warning
# Apply alerts
prism namespace alerts create user-events alerts.yaml

# View active alerts
prism namespace alerts list user-events

# Silence alert temporarily
prism namespace alerts silence user-events \
--alert high-latency \
--duration 1h

Backup and Recovery

Create Backup

# Full namespace backup
prism namespace backup user-events \
--output s3://backups/user-events/2025-10-19/

# Incremental backup (only new data)
prism namespace backup user-events \
--incremental \
--since 2025-10-18 \
--output s3://backups/user-events/incremental/

Restore from Backup

# Restore entire namespace
prism namespace restore user-events \
--source s3://backups/user-events/2025-10-19/ \
--confirm

# Restore specific time range
prism namespace restore user-events \
--source s3://backups/user-events/2025-10-19/ \
--start 2025-10-19T10:00:00Z \
--end 2025-10-19T12:00:00Z

Point-in-Time Recovery

# Restore to specific timestamp
prism namespace restore user-events \
--timestamp 2025-10-19T11:30:00Z

# Prism restores data to exactly that point

Migration Operations

Backend Migration

Scenario: Migrate from NATS to Kafka (zero downtime)

# Step 1: Enable shadow traffic
prism namespace migrate user-events \
--target-backend kafka \
--mode shadow

# Prism now:
# - Writes to NATS (primary)
# - Writes to Kafka (shadow, not used for reads)
# - Compares results

# Step 2: Monitor shadow traffic (24-48 hours)
prism namespace migrate-status user-events

# Output:
# Migration: NATS → Kafka
# Mode: shadow
# Duration: 36 hours
# Write success rate: 99.99% (both backends)
# Comparison: 100% match

# Step 3: Switch reads to Kafka
prism namespace migrate user-events \
--target-backend kafka \
--mode read-from-new

# Prism now:
# - Writes to both NATS and Kafka
# - Reads from Kafka only

# Step 4: Monitor (24 hours)

# Step 5: Complete migration
prism namespace migrate user-events \
--target-backend kafka \
--mode complete

# Prism now:
# - Writes to Kafka only
# - Reads from Kafka only
# - NATS deprovisioned

Rollback Migration

# If issues detected, rollback immediately
prism namespace migrate user-events --rollback

# Prism reverts to previous backend (NATS)

Performance Optimization

Enable Caching

# Enable read cache
prism namespace update user-events \
--cache-enabled \
--cache-ttl 60s

# Check cache hit rate
prism namespace metrics user-events --cache

# Output:
# Cache hit rate: 85%
# Cache size: 1.2GB / 2GB
# Evictions: 1000/sec

Tune Connection Pools

# Increase connection pool (high concurrency)
prism namespace update user-events \
--connection-pool-size 100

# Decrease connection pool (low concurrency)
prism namespace update user-events \
--connection-pool-size 20

Compression

# Enable compression (reduce bandwidth)
prism namespace update user-events \
--compression gzip

# Check compression ratio
prism namespace stats user-events --compression

# Output:
# Original size: 10GB
# Compressed size: 2GB
# Compression ratio: 5:1

Monitoring Dashboards

View Dashboard

# Open Grafana dashboard
prism namespace dashboard user-events

# Opens browser to:
# http://grafana.local/d/prism-user-events

Dashboard Panels

Default dashboard includes:

  1. Throughput: Writes and reads per second
  2. Latency: P50, P95, P99 percentiles
  3. Errors: Error rate and error types
  4. Backend health: Backend status and connectivity
  5. Consumer lag: Per-consumer lag metrics
  6. Resource usage: CPU, memory, storage

Best Practices

Regular Checks

Daily:

  • Check namespace status
  • Review error logs
  • Monitor capacity utilization

Weekly:

  • Review capacity trends
  • Check consumer lag
  • Validate alert configuration

Monthly:

  • Review retention policies
  • Audit access permissions
  • Test backup/restore procedures

Capacity Planning

  1. Monitor trends: Use 30-day rolling average for capacity planning
  2. Scale proactively: Scale at 70% utilization (don't wait for 100%)
  3. Test scaling: Validate scaling in staging before production

Incident Response

When alerts fire:

  1. Acknowledge alert to prevent duplicate notifications
  2. Check namespace status for quick diagnosis
  3. Review recent changes (deployments, config updates)
  4. Check backend health (external dependencies)
  5. Scale resources if capacity-related
  6. Engage platform team if backend issues

Getting Help

Support Channels

# Get help with command
prism namespace --help

# Check system status
prism status

# Contact platform team
# (internal support channel)

Useful Commands

# Summary of all namespaces
prism namespace list --status

# Detailed namespace info
prism namespace show user-events

# Recent activity
prism namespace activity user-events --last 24h

# Export metrics for analysis
prism namespace metrics user-events \
--start 2025-10-01 \
--end 2025-10-19 \
--format csv \
--output metrics.csv

Key Takeaways

  1. Monitor proactively: Check status daily, review trends weekly
  2. Scale before limits: Act at 70-80% capacity, not 100%
  3. Test operations: Validate backup/restore, scaling in staging
  4. Document incidents: Track issues and resolutions
  5. Automate alerts: Set up alerting for critical metrics

Next Steps