MEMO-078: Week 18 - Observability Stack Setup
Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-074, MEMO-075, MEMO-077, RFC-060
Executive Summary
Goal: Deploy production-ready observability stack for 100B vertex graph system
Scope: Metrics (Prometheus), visualization (Grafana), distributed tracing (Jaeger), logging (Loki), alerting (Alertmanager)
Findings:
- Metrics collection: 500K metrics/sec from 2000 instances (Prometheus HA cluster)
- Trace sampling: 1% sampling = 11M spans/sec (Jaeger with Cassandra backend)
- Log aggregation: 10 GB/day structured logs (Loki with S3 storage)
- Dashboard latency: <500ms query time for 30-day retention
- Alert delivery: <30 seconds from threshold breach to PagerDuty
- Storage costs: $5,847/month (reduced from $35,502 via self-hosted Prometheus)
Validation: Observability covers all components validated in MEMO-074 benchmarks
Recommendation: Deploy self-hosted Prometheus + Grafana + Jaeger stack with 30-day retention and tiered alerting
Methodology
Observability Requirements
1. Metrics (Time-Series Data):
- Collect system metrics (CPU, memory, network, disk) from 2000 instances
- Collect application metrics (requests/sec, latency, errors) from proxy nodes
- Collect Redis metrics (ops/sec, memory usage, evictions, replication lag)
- Collect PostgreSQL metrics (queries/sec, connection pool, replication lag)
- Retention: 30 days high-resolution, 1 year downsampled
2. Distributed Tracing:
- Trace query execution from client → proxy → Redis/S3 → response
- Capture latency breakdown by operation (metadata lookup, hot tier access, cold tier load)
- Sample 1% of requests (11M spans/sec from 1.1B ops/sec)
- Retention: 7 days full traces, 30 days sampled
3. Logging:
- Structured JSON logs from proxy nodes, Redis, PostgreSQL
- Centralized aggregation and search
- Retention: 7 days full logs, 30 days errors only
- Privacy: No PII in logs (use trace IDs for correlation)
4. Alerting:
- Multi-tier: Critical (PagerDuty), Warning (Slack), Info (email)
- Auto-remediation: Scale-out on high load, restart unhealthy instances
- Runbook links: Every alert includes link to remediation guide
- On-call rotation: 24/7 coverage with escalation policies
5. Dashboards:
- Infrastructure overview (compute, network, storage utilization)
- Redis performance (ops/sec, latency percentiles, memory, evictions)
- Proxy performance (requests/sec, latency, errors, cache hit rate)
- Network topology (cross-AZ traffic, bandwidth utilization)
- Cost tracking (instance hours, data transfer, storage)
Metrics Collection (Prometheus)
Prometheus Architecture
Deployment Strategy: High-availability cluster with federation
Prometheus Architecture (3-tier):
Tier 1: Local Prometheus (per AZ)
├── AZ us-west-2a: Prometheus instance (scrapes 667 instances)
├── AZ us-west-2b: Prometheus instance (scrapes 667 instances)
└── AZ us-west-2c: Prometheus instance (scrapes 666 instances)
│
│ Federation (aggregate metrics)
↓
Tier 2: Global Prometheus (HA pair)
├── Primary: Aggregates from 3 AZ instances
└── Secondary: Hot standby for failover
│
│ Long-term storage
↓
Tier 3: Thanos (object storage for 1-year retention)
└── S3 bucket: prism-metrics (compressed time-series)
Benefits:
- ✅ Decentralized scraping (local to AZ, low latency)
- ✅ High availability (3 local + 2 global instances)
- ✅ Horizontal scaling (add more local instances per AZ)
- ✅ Cost-effective long-term storage (S3 via Thanos)
Prometheus Configuration
Local Prometheus (per AZ):
# prometheus-local-us-west-2a.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: prism
region: us-west-2
az: us-west-2a
scrape_configs:
# Redis exporters (334 instances in this AZ)
- job_name: 'redis'
static_configs:
- targets:
- 10.0.10.10:9121 # redis-exporter on each Redis instance
- 10.0.10.11:9121
# ... (334 targets total)
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__address__]
regex: '10\.0\.10\.(.*):9121'
replacement: 'redis-${1}'
target_label: redis_instance
# Proxy node exporters (334 instances in this AZ)
- job_name: 'proxy'
kubernetes_sd_configs:
- role: pod
namespaces:
names: [prism]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prism-proxy
- source_labels: [__meta_kubernetes_pod_ip]
target_label: instance
- source_labels: [__address__]
target_label: __address__
replacement: '${1}:9090' # Metrics port
# Node exporters (system metrics from all instances)
- job_name: 'node'
static_configs:
- targets:
- 10.0.10.10:9100
- 10.0.10.11:9100
# ... (668 targets: 334 Redis + 334 Proxy)
# PostgreSQL exporter (1 primary + 2 replicas in this region)
- job_name: 'postgres'
static_configs:
- targets:
- postgres-primary.prism.svc.cluster.local:9187
- postgres-replica-1.prism.svc.cluster.local:9187
# Kubernetes metrics (EKS cluster)
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# cAdvisor (container metrics)
- job_name: 'cadvisor'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# Storage configuration
storage:
tsdb:
path: /prometheus/data
retention.time: 30d
retention.size: 500GB
# Remote write to global Prometheus (federation)
remote_write:
- url: http://prometheus-global-primary.prism.svc.cluster.local:9090/api/v1/write
queue_config:
capacity: 100000
max_shards: 10
min_shards: 1
max_samples_per_send: 10000
batch_send_deadline: 5s
Global Prometheus (HA pair):
# prometheus-global.yml
global:
scrape_interval: 60s # Lower frequency for aggregated metrics
evaluation_interval: 60s
external_labels:
cluster: prism
region: us-west-2
prometheus: global
# Scrape local Prometheus instances (federation)
scrape_configs:
- job_name: 'federate-local'
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{job=~"redis|proxy|postgres|node"}' # Federate all jobs
static_configs:
- targets:
- prometheus-local-us-west-2a.prism.svc.cluster.local:9090
- prometheus-local-us-west-2b.prism.svc.cluster.local:9090
- prometheus-local-us-west-2c.prism.svc.cluster.local:9090
# Alerting rules
rule_files:
- /etc/prometheus/rules/redis.yml
- /etc/prometheus/rules/proxy.yml
- /etc/prometheus/rules/infrastructure.yml
- /etc/prometheus/rules/network.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager.prism.svc.cluster.local:9093
# Remote write to Thanos (long-term storage)
remote_write:
- url: http://thanos-receive.prism.svc.cluster.local:19291/api/v1/receive
Prometheus Deployment (Kubernetes)
StatefulSet (for persistent storage):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-local-us-west-2a
namespace: prism-observability
spec:
serviceName: prometheus-local-us-west-2a
replicas: 1
selector:
matchLabels:
app: prometheus
tier: local
az: us-west-2a
template:
metadata:
labels:
app: prometheus
tier: local
az: us-west-2a
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-2a
containers:
- name: prometheus
image: prom/prometheus:v2.48.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=500GB'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- name: web
containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "8"
memory: "32Gi"
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: config
configMap:
name: prometheus-config-us-west-2a
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 500Gi
Metrics Exporters
Redis Exporter (deployed as sidecar or separate DaemonSet):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: redis-exporter
namespace: prism
spec:
selector:
matchLabels:
app: redis-exporter
template:
metadata:
labels:
app: redis-exporter
spec:
hostNetwork: true # Access Redis on host
containers:
- name: redis-exporter
image: oliver006/redis_exporter:v1.55.0
env:
- name: REDIS_ADDR
value: "localhost:6379"
- name: REDIS_EXPORTER_INCL_SYSTEM_METRICS
value: "true"
ports:
- name: metrics
containerPort: 9121
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Node Exporter (system metrics):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: prism-observability
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- name: metrics
containerPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
PostgreSQL Exporter:
apiVersion: v1
kind: Service
metadata:
name: postgres-exporter
namespace: prism
spec:
selector:
app: postgres-exporter
ports:
- port: 9187
targetPort: 9187
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-exporter
namespace: prism
spec:
replicas: 1
selector:
matchLabels:
app: postgres-exporter
template:
metadata:
labels:
app: postgres-exporter
spec:
containers:
- name: postgres-exporter
image: prometheuscommunity/postgres-exporter:v0.15.0
env:
- name: DATA_SOURCE_NAME
valueFrom:
secretKeyRef:
name: postgres-exporter-secret
key: connection-string # postgresql://user:pass@postgres:5432/prism?sslmode=require
ports:
- name: metrics
containerPort: 9187
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Key Metrics Collected
Redis Metrics (from redis_exporter):
# Operations
redis_commands_processed_total # Total commands processed
redis_commands_duration_seconds_total # Command execution time
# Memory
redis_memory_used_bytes # Current memory usage
redis_memory_max_bytes # Max memory limit
redis_mem_fragmentation_ratio # Memory fragmentation
# Replication
redis_connected_slaves # Number of replicas
redis_replication_lag_seconds # Replica lag
# Cluster
redis_cluster_state # 1=ok, 0=fail
redis_cluster_slots_assigned # Assigned hash slots
# Performance
redis_instantaneous_ops_per_sec # Current ops/sec
redis_keyspace_hits_total # Cache hits
redis_keyspace_misses_total # Cache misses
redis_evicted_keys_total # Evicted keys (memory pressure)
Proxy Metrics (custom metrics from Rust application):
// Rust code to expose metrics
use prometheus::{Counter, Histogram, Registry};
// Request counters
let requests_total = Counter::new("prism_proxy_requests_total", "Total requests")?;
let requests_errors = Counter::new("prism_proxy_requests_errors_total", "Total errors")?;
// Latency histograms
let latency_histogram = Histogram::with_opts(
HistogramOpts::new("prism_proxy_latency_seconds", "Request latency")
.buckets(vec![0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0])
)?;
// Cache metrics
let cache_hit_rate = Gauge::new("prism_proxy_cache_hit_rate", "Cache hit rate")?;
let hot_tier_accesses = Counter::new("prism_proxy_hot_tier_accesses_total", "Hot tier accesses")?;
let cold_tier_accesses = Counter::new("prism_proxy_cold_tier_accesses_total", "Cold tier accesses")?;
// Backend latency (breakdown)
let redis_latency = Histogram::new("prism_proxy_redis_latency_seconds", "Redis latency")?;
let postgres_latency = Histogram::new("prism_proxy_postgres_latency_seconds", "PostgreSQL latency")?;
let s3_latency = Histogram::new("prism_proxy_s3_latency_seconds", "S3 latency")?;
// Register all metrics
let registry = Registry::new();
registry.register(Box::new(requests_total.clone()))?;
registry.register(Box::new(latency_histogram.clone()))?;
// ... register all
// Expose /metrics endpoint
let metrics_route = warp::path("metrics")
.map(move || {
let encoder = TextEncoder::new();
let metric_families = registry.gather();
let mut buffer = vec![];
encoder.encode(&metric_families, &mut buffer).unwrap();
String::from_utf8(buffer).unwrap()
});
Metrics Exposed:
# Requests
prism_proxy_requests_total # Total requests
prism_proxy_requests_errors_total # Total errors
prism_proxy_requests_duration_seconds # Request latency histogram
# Cache
prism_proxy_cache_hit_rate # Hot tier hit rate (0-1)
prism_proxy_hot_tier_accesses_total # Hot tier accesses
prism_proxy_cold_tier_accesses_total # Cold tier accesses
# Backend latency
prism_proxy_redis_latency_seconds # Redis query time
prism_proxy_postgres_latency_seconds # PostgreSQL query time
prism_proxy_s3_latency_seconds # S3 load time
# Connections
prism_proxy_active_connections # Current active connections
prism_proxy_connection_pool_size # Connection pool size
Node Metrics (from node_exporter):
# CPU
node_cpu_seconds_total # CPU time by mode (idle, system, user)
node_load1, node_load5, node_load15 # Load averages
# Memory
node_memory_MemTotal_bytes # Total memory
node_memory_MemAvailable_bytes # Available memory
node_memory_MemFree_bytes # Free memory
# Network
node_network_receive_bytes_total # Bytes received
node_network_transmit_bytes_total # Bytes transmitted
node_network_receive_packets_total # Packets received
node_network_transmit_packets_total # Packets transmitted
# Disk
node_disk_read_bytes_total # Disk read bytes
node_disk_written_bytes_total # Disk write bytes
node_disk_io_time_seconds_total # Disk I/O time
node_filesystem_avail_bytes # Available disk space
PostgreSQL Metrics (from postgres_exporter):
# Connections
pg_stat_database_numbackends # Active connections
# Replication
pg_stat_replication_replay_lag # Replication lag (bytes)
pg_replication_lag_seconds # Replication lag (seconds)
# Queries
pg_stat_database_xact_commit # Committed transactions
pg_stat_database_xact_rollback # Rolled back transactions
pg_stat_database_blks_read # Blocks read from disk
pg_stat_database_blks_hit # Blocks found in cache
# Locks
pg_locks_count # Lock count by mode
Metrics Storage Capacity
Local Prometheus (per AZ):
Metrics per instance:
- Redis: 50 metrics × 334 instances = 16,700 metrics
- Proxy: 30 metrics × 334 instances = 10,020 metrics
- Node: 100 metrics × 668 instances = 66,800 metrics
- PostgreSQL: 50 metrics × 1 instance = 50 metrics
- Kubernetes: 200 metrics (cluster-wide)
Total per AZ: ~94,000 metrics
Scrape interval: 15 seconds
Data points per day: 94,000 metrics × (86,400 seconds / 15) = 541M data points/day
Storage size (uncompressed): 541M × 16 bytes = 8.7 GB/day
Storage size (compressed 10:1): 8.7 GB / 10 = 870 MB/day
30-day retention: 870 MB × 30 = 26 GB
Actual storage (500 GB allocated): 19× headroom for growth
Global Prometheus:
Federated metrics: ~10,000 (aggregated from 3 AZ instances)
Scrape interval: 60 seconds
Data points per day: 10,000 × (86,400 / 60) = 14.4M data points/day
Storage size (compressed): 14.4M × 16 / 10 = 230 MB/day
30-day retention: 230 MB × 30 = 6.9 GB
Actual storage (500 GB allocated): 72× headroom
Assessment: ✅ Storage capacity sufficient for 30-day retention with significant headroom
Visualization (Grafana)
Grafana Deployment
Kubernetes Deployment (HA pair):
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: prism-observability
spec:
replicas: 2 # HA
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- grafana
topologyKey: kubernetes.io/hostname
containers:
- name: grafana
image: grafana/grafana:10.2.2
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
- name: GF_DATABASE_TYPE
value: postgres
- name: GF_DATABASE_HOST
value: postgres.prism.svc.cluster.local:5432
- name: GF_DATABASE_NAME
value: grafana
- name: GF_DATABASE_USER
valueFrom:
secretKeyRef:
name: grafana-secret
key: db-user
- name: GF_DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: db-password
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
- name: GF_AUTH_DISABLE_LOGIN_FORM
value: "false"
ports:
- name: web
containerPort: 3000
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-dashboards
mountPath: /etc/grafana/provisioning/dashboards
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-dashboards
configMap:
name: grafana-dashboards
- name: grafana-datasources
configMap:
name: grafana-datasources
Service (exposed via ALB):
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: prism-observability
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
spec:
type: LoadBalancer
selector:
app: grafana
ports:
- port: 80
targetPort: 3000
protocol: TCP
Datasource Configuration
Prometheus Datasource:
# grafana-datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus-Global
type: prometheus
access: proxy
url: http://prometheus-global-primary.prism.svc.cluster.local:9090
isDefault: true
jsonData:
timeInterval: "15s"
queryTimeout: "60s"
httpMethod: POST
- name: Prometheus-US-West-2a
type: prometheus
access: proxy
url: http://prometheus-local-us-west-2a.prism.svc.cluster.local:9090
jsonData:
timeInterval: "15s"
- name: Prometheus-US-West-2b
type: prometheus
access: proxy
url: http://prometheus-local-us-west-2b.prism.svc.cluster.local:9090
jsonData:
timeInterval: "15s"
- name: Prometheus-US-West-2c
type: prometheus
access: proxy
url: http://prometheus-local-us-west-2c.prism.svc.cluster.local:9090
jsonData:
timeInterval: "15s"
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger-query.prism-observability.svc.cluster.local:16686
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['trace_id']
- name: Loki
type: loki
access: proxy
url: http://loki-gateway.prism-observability.svc.cluster.local:3100
jsonData:
maxLines: 1000
Grafana Dashboards
Dashboard 1: Infrastructure Overview
{
"dashboard": {
"title": "Prism Infrastructure Overview",
"rows": [
{
"title": "Compute Resources",
"panels": [
{
"title": "CPU Utilization (%)",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "Average CPU"
}
],
"type": "graph"
},
{
"title": "Memory Utilization (%)",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}"
}
],
"type": "graph"
},
{
"title": "Network Throughput (Gbps)",
"targets": [
{
"expr": "sum(rate(node_network_transmit_bytes_total[5m])) * 8 / 1e9",
"legendFormat": "Transmit"
},
{
"expr": "sum(rate(node_network_receive_bytes_total[5m])) * 8 / 1e9",
"legendFormat": "Receive"
}
],
"type": "graph"
}
]
},
{
"title": "Instance Health",
"panels": [
{
"title": "Redis Instances (Up/Down)",
"targets": [
{
"expr": "count(up{job=\"redis\"} == 1)",
"legendFormat": "Up"
},
{
"expr": "count(up{job=\"redis\"} == 0)",
"legendFormat": "Down"
}
],
"type": "singlestat"
},
{
"title": "Proxy Instances (Up/Down)",
"targets": [
{
"expr": "count(up{job=\"proxy\"} == 1)",
"legendFormat": "Up"
},
{
"expr": "count(up{job=\"proxy\"} == 0)",
"legendFormat": "Down"
}
],
"type": "singlestat"
}
]
}
]
}
}
Dashboard 2: Redis Performance
Key panels:
- Operations per second (instantaneous)
- Command latency (p50, p95, p99)
- Memory usage (used, max, fragmentation)
- Eviction rate (keys evicted/sec)
- Cache hit rate (hits / (hits + misses))
- Replication lag (seconds behind master)
- Cluster health (slots assigned, state)
Dashboard 3: Proxy Performance
Key panels:
- Requests per second
- Request latency (p50, p95, p99) by operation type
- Error rate (errors/sec, % of total)
- Cache hit rate (hot tier hit %)
- Backend latency breakdown (Redis, PostgreSQL, S3)
- Active connections
- Connection pool utilization
Dashboard 4: Network Topology
Key panels:
- Cross-AZ traffic (% of total traffic)
- Bandwidth utilization by AZ
- Packet loss rate
- Cross-AZ latency (average, p95, p99)
- Data transfer costs (estimated monthly)
Dashboard 5: Cost Tracking
Key panels:
- Instance hours by type (r6i.4xlarge, c6i.2xlarge)
- Data transfer (intra-AZ, cross-AZ, internet)
- Storage utilization (EBS, S3)
- Estimated monthly cost (running total)
Dashboard Query Performance
Query Example (proxy latency p99):
histogram_quantile(0.99,
sum(rate(prism_proxy_requests_duration_seconds_bucket[5m])) by (le)
)
Query Execution Time (from MEMO-074 benchmarks):
- 30-day retention: <500ms
- 7-day retention: <200ms
- Real-time (last 5 minutes): <50ms
Assessment: ✅ Dashboard queries performant for operational use
Distributed Tracing (Jaeger)
Jaeger Architecture
Deployment Strategy: All-in-one for dev, production deployment with Cassandra backend
Jaeger Architecture (Production):
Client (Proxy Nodes)
↓ UDP 6831 (jaeger.thrift compact)
Jaeger Agent (DaemonSet on each node)
↓ gRPC
Jaeger Collector (replicas: 3)
↓ Write spans
Cassandra Cluster (3 nodes, RF=3)
↑ Read spans
Jaeger Query Service (replicas: 2)
↑ HTTP 16686
Grafana Explore
Benefits:
- ✅ Low-latency span submission (UDP to local agent)
- ✅ Buffering at collector (handles burst traffic)
- ✅ Scalable storage (Cassandra horizontal scaling)
- ✅ High availability (3 collectors, 2 query services)
Jaeger Deployment
Jaeger Agent (DaemonSet):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: jaeger-agent
namespace: prism-observability
spec:
selector:
matchLabels:
app: jaeger-agent
template:
metadata:
labels:
app: jaeger-agent
spec:
hostNetwork: true
containers:
- name: jaeger-agent
image: jaegertracing/jaeger-agent:1.51.0
args:
- --reporter.grpc.host-port=jaeger-collector.prism-observability.svc.cluster.local:14250
- --reporter.grpc.retry.max=10
ports:
- name: compact
containerPort: 6831
protocol: UDP
- name: binary
containerPort: 6832
protocol: UDP
- name: admin
containerPort: 14271
protocol: TCP
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Jaeger Collector:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
namespace: prism-observability
spec:
replicas: 3
selector:
matchLabels:
app: jaeger-collector
template:
metadata:
labels:
app: jaeger-collector
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:1.51.0
args:
- --cassandra.keyspace=jaeger_v1_dc1
- --cassandra.servers=cassandra.prism-observability.svc.cluster.local
- --cassandra.username=jaeger
- --cassandra.password=$(CASSANDRA_PASSWORD)
- --collector.zipkin.host-port=:9411
- --collector.num-workers=50
- --collector.queue-size=10000
env:
- name: CASSANDRA_PASSWORD
valueFrom:
secretKeyRef:
name: jaeger-cassandra-secret
key: password
ports:
- name: grpc
containerPort: 14250
- name: http
containerPort: 14268
- name: zipkin
containerPort: 9411
- name: admin
containerPort: 14269
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"
Jaeger Query Service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-query
namespace: prism-observability
spec:
replicas: 2
selector:
matchLabels:
app: jaeger-query
template:
metadata:
labels:
app: jaeger-query
spec:
containers:
- name: jaeger-query
image: jaegertracing/jaeger-query:1.51.0
args:
- --cassandra.keyspace=jaeger_v1_dc1
- --cassandra.servers=cassandra.prism-observability.svc.cluster.local
- --cassandra.username=jaeger
- --cassandra.password=$(CASSANDRA_PASSWORD)
env:
- name: CASSANDRA_PASSWORD
valueFrom:
secretKeyRef:
name: jaeger-cassandra-secret
key: password
ports:
- name: query
containerPort: 16686
- name: admin
containerPort: 16687
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
Cassandra Backend (for Jaeger)
StatefulSet (3 nodes, replication factor 3):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
namespace: prism-observability
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cassandra
topologyKey: kubernetes.io/hostname
containers:
- name: cassandra
image: cassandra:4.1
env:
- name: CASSANDRA_CLUSTER_NAME
value: "prism-jaeger"
- name: CASSANDRA_DC
value: "DC1"
- name: CASSANDRA_RACK
value: "Rack1"
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.prism-observability.svc.cluster.local"
ports:
- name: cql
containerPort: 9042
- name: gossip
containerPort: 7000
volumeMounts:
- name: cassandra-data
mountPath: /var/lib/cassandra
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 500Gi
Cassandra Schema (initialized via CQL):
CREATE KEYSPACE IF NOT EXISTS jaeger_v1_dc1
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
USE jaeger_v1_dc1;
CREATE TABLE IF NOT EXISTS traces (
trace_id blob,
span_id bigint,
span_hash bigint,
parent_id bigint,
operation_name text,
flags int,
start_time timestamp,
duration bigint,
tags list<frozen<tag>>,
logs list<frozen<log>>,
refs list<frozen<span_ref>>,
process frozen<process>,
PRIMARY KEY (trace_id, span_id, span_hash)
) WITH compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': 1, 'compaction_window_unit': 'HOURS'};
-- Indexes for efficient querying
CREATE INDEX IF NOT EXISTS traces_start_time_idx ON traces (start_time);
CREATE INDEX IF NOT EXISTS traces_operation_idx ON traces (operation_name);
Trace Sampling Strategy
Sampling Configuration:
# Sampling per MEMO-074 analysis: 1% sampling for 1.1B ops/sec = 11M spans/sec
apiVersion: v1
kind: ConfigMap
metadata:
name: jaeger-sampling-config
namespace: prism-observability
data:
sampling.json: |
{
"default_strategy": {
"type": "probabilistic",
"param": 0.01
},
"per_operation_strategies": {
"prism-proxy": [
{
"operation": "GetVertex",
"type": "probabilistic",
"param": 0.01
},
{
"operation": "GetEdges",
"type": "probabilistic",
"param": 0.01
},
{
"operation": "TraverseGraph",
"type": "probabilistic",
"param": 0.1
},
{
"operation": "HealthCheck",
"type": "probabilistic",
"param": 0.0001
}
]
}
}
Sampling Rationale:
- 1% default: 11M spans/sec (manageable by Cassandra)
- 10% for complex queries (traversals): Higher sampling for operations we care most about
- 0.01% for health checks: Reduce noise from high-frequency low-value operations
Span Volume:
Expected spans per day:
- GetVertex (70% of traffic): 1.1B ops/sec × 0.7 × 0.01 = 7.7M spans/sec
- GetEdges (20%): 1.1B × 0.2 × 0.01 = 2.2M spans/sec
- TraverseGraph (10%): 1.1B × 0.1 × 0.1 = 11M spans/sec
Total: 21M spans/sec
Daily: 21M spans/sec × 86,400 = 1.8 trillion spans/day
Average span size: 1 KB
Daily storage: 1.8T × 1 KB = 1.8 TB/day
7-day retention: 1.8 TB × 7 = 12.6 TB
Cassandra compression (5:1): 12.6 TB / 5 = 2.5 TB
Allocated storage (3 nodes × 500 GB × 3 RF): 4.5 TB
Utilization: 2.5 TB / 4.5 TB = 56%
Assessment: ✅ Storage capacity sufficient for 7-day trace retention
Trace Instrumentation (Proxy)
Rust OpenTelemetry Integration:
use opentelemetry::{global, trace::{Tracer, SpanKind}, KeyValue};
use opentelemetry_jaeger::new_pipeline;
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{layer::SubscriberExt, Registry};
// Initialize Jaeger tracer
let tracer = new_pipeline()
.with_service_name("prism-proxy")
.with_agent_endpoint("localhost:6831")
.with_trace_config(
opentelemetry::sdk::trace::config()
.with_sampler(opentelemetry::sdk::trace::Sampler::TraceIdRatioBased(0.01))
)
.install_batch(opentelemetry::runtime::Tokio)?;
// Set up tracing subscriber
let telemetry = OpenTelemetryLayer::new(tracer);
let subscriber = Registry::default().with(telemetry);
tracing::subscriber::set_global_default(subscriber)?;
// Instrument function with tracing
#[tracing::instrument(skip(self))]
async fn get_vertex(&self, vertex_id: &str) -> Result<Vertex, Error> {
let span = tracing::info_span!("get_vertex", vertex.id = %vertex_id);
let _enter = span.enter();
// 1. Query PostgreSQL for partition metadata
let partition_span = tracing::info_span!("query_partition_metadata");
let _partition_enter = partition_span.enter();
let partition_id = self.postgres.get_partition(vertex_id).await?;
drop(_partition_enter);
// 2. Check hot tier (Redis)
let redis_span = tracing::info_span!("redis_get", partition = %partition_id);
let _redis_enter = redis_span.enter();
if let Some(vertex) = self.redis.get(vertex_id).await? {
return Ok(vertex);
}
drop(_redis_enter);
// 3. Load from cold tier (S3)
let s3_span = tracing::info_span!("s3_load_partition", partition = %partition_id);
let _s3_enter = s3_span.enter();
let partition = self.s3.load_partition(partition_id).await?;
drop(_s3_enter);
// 4. Promote to hot tier
let promote_span = tracing::info_span!("promote_to_hot_tier");
let _promote_enter = promote_span.enter();
self.redis.set(vertex_id, &partition.get_vertex(vertex_id)).await?;
drop(_promote_enter);
Ok(partition.get_vertex(vertex_id))
}
Trace Example (GetVertex span):
{
"traceID": "5a2d3f8b4c1e6a7b",
"spanID": "1234567890abcdef",
"operationName": "get_vertex",
"startTime": 1700000000000000,
"duration": 45000,
"tags": [
{"key": "vertex.id", "value": "user:123456"},
{"key": "partition.id", "value": "42"},
{"key": "cache.hit", "value": "false"},
{"key": "tier", "value": "cold"}
],
"logs": [],
"references": [],
"process": {
"serviceName": "prism-proxy",
"tags": [
{"key": "hostname", "value": "proxy-node-123"},
{"key": "ip", "value": "10.0.10.50"}
]
},
"children": [
{
"spanID": "2345678901bcdef0",
"operationName": "query_partition_metadata",
"duration": 2000,
"tags": [{"key": "db.type", "value": "postgresql"}]
},
{
"spanID": "3456789012cdef01",
"operationName": "redis_get",
"duration": 800,
"tags": [
{"key": "db.type", "value": "redis"},
{"key": "cache.hit", "value": "false"}
]
},
{
"spanID": "456789013def0123",
"operationName": "s3_load_partition",
"duration": 35000,
"tags": [
{"key": "storage.type", "value": "s3"},
{"key": "partition.size", "value": "100MB"}
]
},
{
"spanID": "56789014ef012345",
"operationName": "promote_to_hot_tier",
"duration": 1200,
"tags": [{"key": "db.type", "value": "redis"}]
}
]
}
Trace Query (in Jaeger UI):
- Service:
prism-proxy - Operation:
get_vertex - Tags:
cache.hit=false,tier=cold - Duration:
>40ms - Result: Shows all cold tier accesses taking >40ms
Logging (Loki)
Loki Architecture
Deployment Strategy: Microservices mode with S3 backend
Loki Architecture:
Proxy Nodes / Redis / PostgreSQL
↓ HTTP (push logs)
Loki Distributor (replicas: 3)
↓ Write to S3
S3 Bucket: prism-logs
↑ Read from S3
Loki Querier (replicas: 2)
↑ Query logs
Grafana Explore
Loki Deployment
Loki Configuration:
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
path_prefix: /loki
storage:
s3:
s3: s3://us-west-2/prism-logs
s3forcepathstyle: true
replication_factor: 1
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v12
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
aws:
s3: s3://us-west-2/prism-logs
region: us-west-2
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
limits_config:
retention_period: 168h # 7 days
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_length: 720h
max_query_lookback: 720h
chunk_store_config:
max_look_back_period: 168h
table_manager:
retention_deletes_enabled: true
retention_period: 168h
Distributor Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: loki-distributor
namespace: prism-observability
spec:
replicas: 3
selector:
matchLabels:
app: loki
component: distributor
template:
metadata:
labels:
app: loki
component: distributor
spec:
containers:
- name: loki
image: grafana/loki:2.9.3
args:
- -config.file=/etc/loki/config.yaml
- -target=distributor
ports:
- name: http
containerPort: 3100
- name: grpc
containerPort: 9096
volumeMounts:
- name: config
mountPath: /etc/loki
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
volumes:
- name: config
configMap:
name: loki-config
Querier Deployment (similar structure, -target=querier).
Log Collection (Promtail)
Promtail DaemonSet (collects logs from all nodes):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: prism-observability
spec:
selector:
matchLabels:
app: promtail
template:
metadata:
labels:
app: promtail
spec:
serviceAccountName: promtail
containers:
- name: promtail
image: grafana/promtail:2.9.3
args:
- -config.file=/etc/promtail/config.yaml
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Promtail Configuration:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki-distributor.prism-observability.svc.cluster.local:3100/loki/api/v1/push
scrape_configs:
# Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
pipeline_stages:
- docker: {}
- json:
expressions:
level: level
msg: msg
trace_id: trace_id
latency_ms: latency_ms
- labels:
level:
trace_id:
# System logs
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslog
Structured Logging (Proxy)
Rust Logging Setup:
use tracing::{info, error, warn};
use tracing_subscriber::fmt::format::FmtSpan;
// Initialize structured logging (JSON format)
tracing_subscriber::fmt()
.json()
.with_max_level(tracing::Level::INFO)
.with_span_events(FmtSpan::CLOSE)
.init();
// Example log statement
#[tracing::instrument(skip(self))]
async fn handle_request(&self, req: Request) -> Result<Response, Error> {
info!(
request.id = %req.id,
request.method = %req.method,
request.path = %req.path,
"Received request"
);
let start = Instant::now();
let result = self.process_request(req).await;
let latency_ms = start.elapsed().as_millis();
match result {
Ok(resp) => {
info!(
request.id = %req.id,
response.status = %resp.status,
latency_ms = %latency_ms,
"Request completed successfully"
);
Ok(resp)
}
Err(e) => {
error!(
request.id = %req.id,
error = %e,
latency_ms = %latency_ms,
"Request failed"
);
Err(e)
}
}
}
Log Output (JSON):
{
"timestamp": "2025-11-16T12:00:00.123456Z",
"level": "INFO",
"target": "prism_proxy",
"fields": {
"message": "Request completed successfully",
"request.id": "req-abc123",
"response.status": 200,
"latency_ms": 3,
"trace_id": "5a2d3f8b4c1e6a7b",
"span.name": "handle_request",
"span.id": "1234567890abcdef"
}
}
Log Retention and Costs
Log Volume:
Log generation:
- Proxy nodes: 1000 × 100 logs/sec = 100,000 logs/sec
- Redis: 1000 × 10 logs/sec = 10,000 logs/sec
- PostgreSQL: 4 × 50 logs/sec = 200 logs/sec
Total: ~110,000 logs/sec
Average log size: 500 bytes (JSON)
Daily volume: 110,000 × 500 bytes × 86,400 = 4.75 GB/day
Weekly volume (7 days): 4.75 GB × 7 = 33.25 GB
Compression (5:1): 33.25 GB / 5 = 6.65 GB
S3 storage cost: 6.65 GB × $0.023/GB = $0.15/month
Assessment: ✅ Negligible storage cost for logs
Alerting (Alertmanager)
Alertmanager Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: prism-observability
spec:
replicas: 2
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.26.0
args:
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager
- --cluster.listen-address=0.0.0.0:9094
- --cluster.peer=alertmanager-0.alertmanager.prism-observability.svc.cluster.local:9094
- --cluster.peer=alertmanager-1.alertmanager.prism-observability.svc.cluster.local:9094
ports:
- name: web
containerPort: 9093
- name: cluster
containerPort: 9094
volumeMounts:
- name: config
mountPath: /etc/alertmanager
- name: storage
mountPath: /alertmanager
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: config
configMap:
name: alertmanager-config
- name: storage
emptyDir: {}
Alertmanager Configuration
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
receiver: 'slack-default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# Warning alerts → Slack
- match:
severity: warning
receiver: 'slack-warnings'
# Info alerts → Email
- match:
severity: info
receiver: 'email-info'
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#prism-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'slack-warnings'
slack_configs:
- channel: '#prism-warnings'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'
severity: 'critical'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
- name: 'email-info'
email_configs:
- to: 'prism-alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
headers:
Subject: 'Prism Alert: {{ .GroupLabels.alertname }}'
Alert Rules
Redis Alerts (prometheus-rules/redis.yml):
groups:
- name: redis-alerts
interval: 30s
rules:
- alert: RedisDown
expr: up{job="redis"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis instance down"
description: "Redis instance {{ $labels.instance }} is down for more than 1 minute"
runbook: "https://runbooks.example.com/redis-down"
- alert: RedisMemoryHigh
expr: (redis_memory_used_bytes / redis_memory_max_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage high"
description: "Redis instance {{ $labels.instance }} memory usage is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/redis-memory"
- alert: RedisEvictionRate
expr: rate(redis_evicted_keys_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Redis eviction rate high"
description: "Redis instance {{ $labels.instance }} evicting {{ $value }} keys/sec"
runbook: "https://runbooks.example.com/redis-evictions"
- alert: RedisReplicationLag
expr: redis_replication_lag_seconds > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Redis replication lag high"
description: "Redis replica {{ $labels.instance }} lagging {{ $value }}s behind master"
runbook: "https://runbooks.example.com/redis-replication-lag"
- alert: RedisClusterDown
expr: redis_cluster_state == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis cluster unhealthy"
description: "Redis cluster {{ $labels.cluster }} is in failed state"
runbook: "https://runbooks.example.com/redis-cluster-down"
Proxy Alerts (prometheus-rules/proxy.yml):
groups:
- name: proxy-alerts
interval: 30s
rules:
- alert: ProxyHighErrorRate
expr: (rate(prism_proxy_requests_errors_total[5m]) / rate(prism_proxy_requests_total[5m])) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Proxy error rate high"
description: "Proxy {{ $labels.instance }} error rate is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/proxy-errors"
- alert: ProxyHighLatency
expr: histogram_quantile(0.99, rate(prism_proxy_requests_duration_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Proxy p99 latency high"
description: "Proxy {{ $labels.instance }} p99 latency is {{ $value }}s"
runbook: "https://runbooks.example.com/proxy-latency"
- alert: ProxyCacheMissRate
expr: prism_proxy_cache_hit_rate < 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Proxy cache hit rate low"
description: "Proxy {{ $labels.instance }} cache hit rate is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/cache-miss"
- alert: ProxyConnectionPoolExhausted
expr: prism_proxy_active_connections / prism_proxy_connection_pool_size > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Proxy connection pool nearly exhausted"
description: "Proxy {{ $labels.instance }} using {{ $value | humanizePercentage }} of connection pool"
runbook: "https://runbooks.example.com/connection-pool"
Infrastructure Alerts (prometheus-rules/infrastructure.yml):
groups:
- name: infrastructure-alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "Instance {{ $labels.instance }} CPU usage is {{ $value }}%"
runbook: "https://runbooks.example.com/high-cpu"
- alert: HighMemoryUsage
expr: ((1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100) > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage"
description: "Instance {{ $labels.instance }} memory usage is {{ $value }}%"
runbook: "https://runbooks.example.com/high-memory"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low"
description: "Instance {{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value | humanizePercentage }} free"
runbook: "https://runbooks.example.com/disk-space"
- alert: NetworkThroughputHigh
expr: (rate(node_network_transmit_bytes_total[5m]) * 8 / 1e9) > 8
for: 10m
labels:
severity: warning
annotations:
summary: "Network throughput approaching limit"
description: "Instance {{ $labels.instance }} network throughput is {{ $value }} Gbps (80% of 10 Gbps)"
runbook: "https://runbooks.example.com/network-saturation"
Observability Costs
Monthly Cost Summary
| Component | Cost/month | Notes |
|---|---|---|
| Prometheus (self-hosted) | $1,854 | 5 instances (3 local + 2 global) × c6i.xlarge ($0.17/hour reserved) × 730 hours + 2.5 TB EBS ($0.08/GB) |
| Grafana (self-hosted) | $248 | 2 instances × c6i.large ($0.085/hour reserved) × 730 hours + 100 GB EBS |
| Jaeger | $1,854 | 3 collectors + 2 query × c6i.large + 3 agents (DaemonSet, no cost) |
| Cassandra (Jaeger backend) | $1,112 | 3 nodes × c6i.2xlarge ($0.17/hour reserved) × 730 hours + 1.5 TB EBS |
| Loki | $372 | 3 distributors + 2 queriers × t3.medium ($0.0416/hour on-demand) × 730 hours |
| S3 storage | $407 | Thanos (200 GB @ $0.023/GB) + Loki logs (33 GB) + Jaeger overflow |
| Total | $5,847 | vs $35,502 CloudWatch-only (MEMO-077), 84% reduction |
Cost Breakdown Rationale:
- ✅ Self-hosted Prometheus: $30K/month savings vs CloudWatch Metrics (100K custom metrics)
- ✅ S3 for long-term storage: 96% cheaper than CloudWatch retention
- ✅ Cassandra for traces: Cheaper than managed tracing (AWS X-Ray at $5/1M spans = $105K/month for 21M spans/sec)
- ⚠️ Operational overhead: Requires SRE team to manage observability stack
Total Infrastructure + Observability Costs:
- Infrastructure (MEMO-077): $938,757/month
- Observability: $5,847/month
- Total: $944,604/month ($11.3M/year, $34.0M over 3 years)
- vs MEMO-076 baseline ($32.4M): 5% higher, acceptable for production observability
Recommendations
Primary Recommendation
Deploy self-hosted observability stack with:
- ✅ Prometheus HA cluster (3 local + 2 global instances, 30-day retention)
- ✅ Grafana (2 instances for HA, PostgreSQL backend for dashboard persistence)
- ✅ Jaeger with Cassandra (3 collectors, 2 query services, 7-day trace retention, 1% sampling)
- ✅ Loki with S3 backend (7-day log retention, structured JSON logs)
- ✅ Alertmanager (2 instances for HA, PagerDuty + Slack + email receivers)
- ✅ Thanos (long-term metrics storage in S3, 1-year retention)
Monthly Cost: $5,847 (84% cheaper than CloudWatch-only approach)
Operational Trade-off: Requires SRE team to manage observability infrastructure, but provides:
- Full control over sampling, retention, costs
- No vendor lock-in
- Integration with existing tools (Grafana, Jaeger)
- Significantly lower costs at scale
Dashboard Priorities
Week 1 (production launch):
- Infrastructure overview (compute, memory, network)
- Redis performance (ops/sec, latency, memory)
- Proxy performance (requests/sec, latency, errors)
Week 2 (operational maturity): 4. Network topology (cross-AZ traffic, bandwidth) 5. Cost tracking (instance hours, data transfer)
Week 3 (deep observability): 6. Distributed tracing integration (Jaeger in Grafana Explore) 7. Log correlation (Loki logs linked to traces)
Alert Tuning Strategy
Phase 1: Conservative (first 30 days):
- High thresholds to avoid alert fatigue
- All critical alerts routed to on-call
- Daily alert review meetings
Phase 2: Calibration (30-90 days):
- Adjust thresholds based on observed baselines
- Add warning alerts for leading indicators
- Tune group_wait and group_interval
Phase 3: Mature (90+ days):
- Fine-grained alerts with context
- Auto-remediation for common issues
- Runbooks tested and updated
Next Steps
Week 19: Development Tooling and CI/CD Pipelines
Focus: Build and deployment automation for continuous delivery
Tasks:
- CI/CD pipeline design (GitHub Actions, GitLab CI, or Jenkins)
- Docker image builds (Rust proxy, Redis with custom config)
- Terraform pipeline (plan, apply, destroy workflows)
- Kubernetes manifests management (Helm charts, Kustomize)
- Automated testing integration (unit, integration, load tests)
Success Criteria:
- Automated deployments from Git commits
- Infrastructure changes reviewed and approved before apply
- Rollback capability within 5 minutes
- Blue/green deployment strategy for proxy updates
Appendices
Appendix A: Prometheus Query Examples
Redis Operations per Second:
sum(rate(redis_commands_processed_total[5m])) by (instance)
Proxy p99 Latency:
histogram_quantile(0.99,
sum(rate(prism_proxy_requests_duration_seconds_bucket[5m])) by (le)
)
Cache Hit Rate:
sum(rate(redis_keyspace_hits_total[5m])) /
(sum(rate(redis_keyspace_hits_total[5m])) + sum(rate(redis_keyspace_misses_total[5m])))
Cross-AZ Traffic Percentage:
(sum(rate(node_network_transmit_bytes_total{az!="us-west-2a"}[5m])) /
sum(rate(node_network_transmit_bytes_total[5m]))) * 100
Appendix B: Runbook Template
Title: Redis Instance Down
Severity: Critical
Symptoms:
- Alert:
RedisDownfiring - Prometheus target
redis:9121unreachable - Graph queries returning errors
Investigation:
- Check instance health:
aws ec2 describe-instance-status --instance-id i-xxxxx - SSH to instance (or use Systems Manager):
aws ssm start-session --target i-xxxxx - Check Redis process:
systemctl status redis - Check Redis logs:
journalctl -u redis -n 100
Resolution:
- If process crashed:
systemctl restart redis - If instance failed: Terminate instance, Auto Scaling Group will replace
- If cluster split-brain: Follow Redis Cluster recovery procedure (link to detailed runbook)
Prevention:
- Monitor Redis memory usage (alert before OOM)
- Enable Redis persistence (RDB + AOF)
- Ensure Auto Scaling Group health checks configured
Appendix C: Observability Validation Checklist
Metrics:
- All 2000 instances scraped by Prometheus
- No scrape errors in last 24 hours
- Prometheus storage utilization <80%
- Grafana dashboards loading <500ms
- Alert rules validated (test firing)
Tracing:
- Jaeger receiving spans from all proxy nodes
- Trace sampling rate = 1% (measured)
- Cassandra storage utilization <60%
- End-to-end traces visible in Grafana Explore
- Trace-to-log correlation working
Logging:
- Loki receiving logs from all nodes
- Logs searchable in Grafana Explore
- Log volume <10 GB/day
- S3 storage costs <$1/month
- No PII in logs (verified with sample queries)
Alerting:
- PagerDuty integration tested (test alert sent)
- Slack notifications working
- Alert grouping configured correctly
- Runbooks linked from all alerts
- On-call rotation configured