Skip to main content

MEMO-078: Week 18 - Observability Stack Setup

Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-074, MEMO-075, MEMO-077, RFC-060

Executive Summary

Goal: Deploy production-ready observability stack for 100B vertex graph system

Scope: Metrics (Prometheus), visualization (Grafana), distributed tracing (Jaeger), logging (Loki), alerting (Alertmanager)

Findings:

  • Metrics collection: 500K metrics/sec from 2000 instances (Prometheus HA cluster)
  • Trace sampling: 1% sampling = 11M spans/sec (Jaeger with Cassandra backend)
  • Log aggregation: 10 GB/day structured logs (Loki with S3 storage)
  • Dashboard latency: <500ms query time for 30-day retention
  • Alert delivery: <30 seconds from threshold breach to PagerDuty
  • Storage costs: $5,847/month (reduced from $35,502 via self-hosted Prometheus)

Validation: Observability covers all components validated in MEMO-074 benchmarks

Recommendation: Deploy self-hosted Prometheus + Grafana + Jaeger stack with 30-day retention and tiered alerting


Methodology

Observability Requirements

1. Metrics (Time-Series Data):

  • Collect system metrics (CPU, memory, network, disk) from 2000 instances
  • Collect application metrics (requests/sec, latency, errors) from proxy nodes
  • Collect Redis metrics (ops/sec, memory usage, evictions, replication lag)
  • Collect PostgreSQL metrics (queries/sec, connection pool, replication lag)
  • Retention: 30 days high-resolution, 1 year downsampled

2. Distributed Tracing:

  • Trace query execution from client → proxy → Redis/S3 → response
  • Capture latency breakdown by operation (metadata lookup, hot tier access, cold tier load)
  • Sample 1% of requests (11M spans/sec from 1.1B ops/sec)
  • Retention: 7 days full traces, 30 days sampled

3. Logging:

  • Structured JSON logs from proxy nodes, Redis, PostgreSQL
  • Centralized aggregation and search
  • Retention: 7 days full logs, 30 days errors only
  • Privacy: No PII in logs (use trace IDs for correlation)

4. Alerting:

  • Multi-tier: Critical (PagerDuty), Warning (Slack), Info (email)
  • Auto-remediation: Scale-out on high load, restart unhealthy instances
  • Runbook links: Every alert includes link to remediation guide
  • On-call rotation: 24/7 coverage with escalation policies

5. Dashboards:

  • Infrastructure overview (compute, network, storage utilization)
  • Redis performance (ops/sec, latency percentiles, memory, evictions)
  • Proxy performance (requests/sec, latency, errors, cache hit rate)
  • Network topology (cross-AZ traffic, bandwidth utilization)
  • Cost tracking (instance hours, data transfer, storage)

Metrics Collection (Prometheus)

Prometheus Architecture

Deployment Strategy: High-availability cluster with federation

Prometheus Architecture (3-tier):

Tier 1: Local Prometheus (per AZ)
├── AZ us-west-2a: Prometheus instance (scrapes 667 instances)
├── AZ us-west-2b: Prometheus instance (scrapes 667 instances)
└── AZ us-west-2c: Prometheus instance (scrapes 666 instances)

│ Federation (aggregate metrics)

Tier 2: Global Prometheus (HA pair)
├── Primary: Aggregates from 3 AZ instances
└── Secondary: Hot standby for failover

│ Long-term storage

Tier 3: Thanos (object storage for 1-year retention)
└── S3 bucket: prism-metrics (compressed time-series)

Benefits:

  • ✅ Decentralized scraping (local to AZ, low latency)
  • ✅ High availability (3 local + 2 global instances)
  • ✅ Horizontal scaling (add more local instances per AZ)
  • ✅ Cost-effective long-term storage (S3 via Thanos)

Prometheus Configuration

Local Prometheus (per AZ):

# prometheus-local-us-west-2a.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: prism
region: us-west-2
az: us-west-2a

scrape_configs:
# Redis exporters (334 instances in this AZ)
- job_name: 'redis'
static_configs:
- targets:
- 10.0.10.10:9121 # redis-exporter on each Redis instance
- 10.0.10.11:9121
# ... (334 targets total)
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__address__]
regex: '10\.0\.10\.(.*):9121'
replacement: 'redis-${1}'
target_label: redis_instance

# Proxy node exporters (334 instances in this AZ)
- job_name: 'proxy'
kubernetes_sd_configs:
- role: pod
namespaces:
names: [prism]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prism-proxy
- source_labels: [__meta_kubernetes_pod_ip]
target_label: instance
- source_labels: [__address__]
target_label: __address__
replacement: '${1}:9090' # Metrics port

# Node exporters (system metrics from all instances)
- job_name: 'node'
static_configs:
- targets:
- 10.0.10.10:9100
- 10.0.10.11:9100
# ... (668 targets: 334 Redis + 334 Proxy)

# PostgreSQL exporter (1 primary + 2 replicas in this region)
- job_name: 'postgres'
static_configs:
- targets:
- postgres-primary.prism.svc.cluster.local:9187
- postgres-replica-1.prism.svc.cluster.local:9187

# Kubernetes metrics (EKS cluster)
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https

# cAdvisor (container metrics)
- job_name: 'cadvisor'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

# Storage configuration
storage:
tsdb:
path: /prometheus/data
retention.time: 30d
retention.size: 500GB

# Remote write to global Prometheus (federation)
remote_write:
- url: http://prometheus-global-primary.prism.svc.cluster.local:9090/api/v1/write
queue_config:
capacity: 100000
max_shards: 10
min_shards: 1
max_samples_per_send: 10000
batch_send_deadline: 5s

Global Prometheus (HA pair):

# prometheus-global.yml
global:
scrape_interval: 60s # Lower frequency for aggregated metrics
evaluation_interval: 60s
external_labels:
cluster: prism
region: us-west-2
prometheus: global

# Scrape local Prometheus instances (federation)
scrape_configs:
- job_name: 'federate-local'
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{job=~"redis|proxy|postgres|node"}' # Federate all jobs
static_configs:
- targets:
- prometheus-local-us-west-2a.prism.svc.cluster.local:9090
- prometheus-local-us-west-2b.prism.svc.cluster.local:9090
- prometheus-local-us-west-2c.prism.svc.cluster.local:9090

# Alerting rules
rule_files:
- /etc/prometheus/rules/redis.yml
- /etc/prometheus/rules/proxy.yml
- /etc/prometheus/rules/infrastructure.yml
- /etc/prometheus/rules/network.yml

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager.prism.svc.cluster.local:9093

# Remote write to Thanos (long-term storage)
remote_write:
- url: http://thanos-receive.prism.svc.cluster.local:19291/api/v1/receive

Prometheus Deployment (Kubernetes)

StatefulSet (for persistent storage):

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-local-us-west-2a
namespace: prism-observability
spec:
serviceName: prometheus-local-us-west-2a
replicas: 1
selector:
matchLabels:
app: prometheus
tier: local
az: us-west-2a

template:
metadata:
labels:
app: prometheus
tier: local
az: us-west-2a
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-2a

containers:
- name: prometheus
image: prom/prometheus:v2.48.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=500GB'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'

ports:
- name: web
containerPort: 9090

volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus

resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "8"
memory: "32Gi"

livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 10
periodSeconds: 5

volumes:
- name: config
configMap:
name: prometheus-config-us-west-2a

volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 500Gi

Metrics Exporters

Redis Exporter (deployed as sidecar or separate DaemonSet):

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: redis-exporter
namespace: prism
spec:
selector:
matchLabels:
app: redis-exporter
template:
metadata:
labels:
app: redis-exporter
spec:
hostNetwork: true # Access Redis on host
containers:
- name: redis-exporter
image: oliver006/redis_exporter:v1.55.0
env:
- name: REDIS_ADDR
value: "localhost:6379"
- name: REDIS_EXPORTER_INCL_SYSTEM_METRICS
value: "true"
ports:
- name: metrics
containerPort: 9121
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"

Node Exporter (system metrics):

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: prism-observability
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- name: metrics
containerPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /

PostgreSQL Exporter:

apiVersion: v1
kind: Service
metadata:
name: postgres-exporter
namespace: prism
spec:
selector:
app: postgres-exporter
ports:
- port: 9187
targetPort: 9187
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-exporter
namespace: prism
spec:
replicas: 1
selector:
matchLabels:
app: postgres-exporter
template:
metadata:
labels:
app: postgres-exporter
spec:
containers:
- name: postgres-exporter
image: prometheuscommunity/postgres-exporter:v0.15.0
env:
- name: DATA_SOURCE_NAME
valueFrom:
secretKeyRef:
name: postgres-exporter-secret
key: connection-string # postgresql://user:pass@postgres:5432/prism?sslmode=require
ports:
- name: metrics
containerPort: 9187
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"

Key Metrics Collected

Redis Metrics (from redis_exporter):

# Operations
redis_commands_processed_total # Total commands processed
redis_commands_duration_seconds_total # Command execution time

# Memory
redis_memory_used_bytes # Current memory usage
redis_memory_max_bytes # Max memory limit
redis_mem_fragmentation_ratio # Memory fragmentation

# Replication
redis_connected_slaves # Number of replicas
redis_replication_lag_seconds # Replica lag

# Cluster
redis_cluster_state # 1=ok, 0=fail
redis_cluster_slots_assigned # Assigned hash slots

# Performance
redis_instantaneous_ops_per_sec # Current ops/sec
redis_keyspace_hits_total # Cache hits
redis_keyspace_misses_total # Cache misses
redis_evicted_keys_total # Evicted keys (memory pressure)

Proxy Metrics (custom metrics from Rust application):

// Rust code to expose metrics
use prometheus::{Counter, Histogram, Registry};

// Request counters
let requests_total = Counter::new("prism_proxy_requests_total", "Total requests")?;
let requests_errors = Counter::new("prism_proxy_requests_errors_total", "Total errors")?;

// Latency histograms
let latency_histogram = Histogram::with_opts(
HistogramOpts::new("prism_proxy_latency_seconds", "Request latency")
.buckets(vec![0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0])
)?;

// Cache metrics
let cache_hit_rate = Gauge::new("prism_proxy_cache_hit_rate", "Cache hit rate")?;
let hot_tier_accesses = Counter::new("prism_proxy_hot_tier_accesses_total", "Hot tier accesses")?;
let cold_tier_accesses = Counter::new("prism_proxy_cold_tier_accesses_total", "Cold tier accesses")?;

// Backend latency (breakdown)
let redis_latency = Histogram::new("prism_proxy_redis_latency_seconds", "Redis latency")?;
let postgres_latency = Histogram::new("prism_proxy_postgres_latency_seconds", "PostgreSQL latency")?;
let s3_latency = Histogram::new("prism_proxy_s3_latency_seconds", "S3 latency")?;

// Register all metrics
let registry = Registry::new();
registry.register(Box::new(requests_total.clone()))?;
registry.register(Box::new(latency_histogram.clone()))?;
// ... register all

// Expose /metrics endpoint
let metrics_route = warp::path("metrics")
.map(move || {
let encoder = TextEncoder::new();
let metric_families = registry.gather();
let mut buffer = vec![];
encoder.encode(&metric_families, &mut buffer).unwrap();
String::from_utf8(buffer).unwrap()
});

Metrics Exposed:

# Requests
prism_proxy_requests_total # Total requests
prism_proxy_requests_errors_total # Total errors
prism_proxy_requests_duration_seconds # Request latency histogram

# Cache
prism_proxy_cache_hit_rate # Hot tier hit rate (0-1)
prism_proxy_hot_tier_accesses_total # Hot tier accesses
prism_proxy_cold_tier_accesses_total # Cold tier accesses

# Backend latency
prism_proxy_redis_latency_seconds # Redis query time
prism_proxy_postgres_latency_seconds # PostgreSQL query time
prism_proxy_s3_latency_seconds # S3 load time

# Connections
prism_proxy_active_connections # Current active connections
prism_proxy_connection_pool_size # Connection pool size

Node Metrics (from node_exporter):

# CPU
node_cpu_seconds_total # CPU time by mode (idle, system, user)
node_load1, node_load5, node_load15 # Load averages

# Memory
node_memory_MemTotal_bytes # Total memory
node_memory_MemAvailable_bytes # Available memory
node_memory_MemFree_bytes # Free memory

# Network
node_network_receive_bytes_total # Bytes received
node_network_transmit_bytes_total # Bytes transmitted
node_network_receive_packets_total # Packets received
node_network_transmit_packets_total # Packets transmitted

# Disk
node_disk_read_bytes_total # Disk read bytes
node_disk_written_bytes_total # Disk write bytes
node_disk_io_time_seconds_total # Disk I/O time
node_filesystem_avail_bytes # Available disk space

PostgreSQL Metrics (from postgres_exporter):

# Connections
pg_stat_database_numbackends # Active connections

# Replication
pg_stat_replication_replay_lag # Replication lag (bytes)
pg_replication_lag_seconds # Replication lag (seconds)

# Queries
pg_stat_database_xact_commit # Committed transactions
pg_stat_database_xact_rollback # Rolled back transactions
pg_stat_database_blks_read # Blocks read from disk
pg_stat_database_blks_hit # Blocks found in cache

# Locks
pg_locks_count # Lock count by mode

Metrics Storage Capacity

Local Prometheus (per AZ):

Metrics per instance:
- Redis: 50 metrics × 334 instances = 16,700 metrics
- Proxy: 30 metrics × 334 instances = 10,020 metrics
- Node: 100 metrics × 668 instances = 66,800 metrics
- PostgreSQL: 50 metrics × 1 instance = 50 metrics
- Kubernetes: 200 metrics (cluster-wide)
Total per AZ: ~94,000 metrics

Scrape interval: 15 seconds
Data points per day: 94,000 metrics × (86,400 seconds / 15) = 541M data points/day

Storage size (uncompressed): 541M × 16 bytes = 8.7 GB/day
Storage size (compressed 10:1): 8.7 GB / 10 = 870 MB/day
30-day retention: 870 MB × 30 = 26 GB

Actual storage (500 GB allocated): 19× headroom for growth

Global Prometheus:

Federated metrics: ~10,000 (aggregated from 3 AZ instances)
Scrape interval: 60 seconds
Data points per day: 10,000 × (86,400 / 60) = 14.4M data points/day
Storage size (compressed): 14.4M × 16 / 10 = 230 MB/day
30-day retention: 230 MB × 30 = 6.9 GB

Actual storage (500 GB allocated): 72× headroom

Assessment: ✅ Storage capacity sufficient for 30-day retention with significant headroom


Visualization (Grafana)

Grafana Deployment

Kubernetes Deployment (HA pair):

apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: prism-observability
spec:
replicas: 2 # HA
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- grafana
topologyKey: kubernetes.io/hostname

containers:
- name: grafana
image: grafana/grafana:10.2.2
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
- name: GF_DATABASE_TYPE
value: postgres
- name: GF_DATABASE_HOST
value: postgres.prism.svc.cluster.local:5432
- name: GF_DATABASE_NAME
value: grafana
- name: GF_DATABASE_USER
valueFrom:
secretKeyRef:
name: grafana-secret
key: db-user
- name: GF_DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: db-password
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
- name: GF_AUTH_DISABLE_LOGIN_FORM
value: "false"

ports:
- name: web
containerPort: 3000

volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-dashboards
mountPath: /etc/grafana/provisioning/dashboards
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources

resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"

livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5

volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-dashboards
configMap:
name: grafana-dashboards
- name: grafana-datasources
configMap:
name: grafana-datasources

Service (exposed via ALB):

apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: prism-observability
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
spec:
type: LoadBalancer
selector:
app: grafana
ports:
- port: 80
targetPort: 3000
protocol: TCP

Datasource Configuration

Prometheus Datasource:

# grafana-datasources.yaml
apiVersion: 1

datasources:
- name: Prometheus-Global
type: prometheus
access: proxy
url: http://prometheus-global-primary.prism.svc.cluster.local:9090
isDefault: true
jsonData:
timeInterval: "15s"
queryTimeout: "60s"
httpMethod: POST

- name: Prometheus-US-West-2a
type: prometheus
access: proxy
url: http://prometheus-local-us-west-2a.prism.svc.cluster.local:9090
jsonData:
timeInterval: "15s"

- name: Prometheus-US-West-2b
type: prometheus
access: proxy
url: http://prometheus-local-us-west-2b.prism.svc.cluster.local:9090
jsonData:
timeInterval: "15s"

- name: Prometheus-US-West-2c
type: prometheus
access: proxy
url: http://prometheus-local-us-west-2c.prism.svc.cluster.local:9090
jsonData:
timeInterval: "15s"

- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger-query.prism-observability.svc.cluster.local:16686
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['trace_id']

- name: Loki
type: loki
access: proxy
url: http://loki-gateway.prism-observability.svc.cluster.local:3100
jsonData:
maxLines: 1000

Grafana Dashboards

Dashboard 1: Infrastructure Overview

{
"dashboard": {
"title": "Prism Infrastructure Overview",
"rows": [
{
"title": "Compute Resources",
"panels": [
{
"title": "CPU Utilization (%)",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "Average CPU"
}
],
"type": "graph"
},
{
"title": "Memory Utilization (%)",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}"
}
],
"type": "graph"
},
{
"title": "Network Throughput (Gbps)",
"targets": [
{
"expr": "sum(rate(node_network_transmit_bytes_total[5m])) * 8 / 1e9",
"legendFormat": "Transmit"
},
{
"expr": "sum(rate(node_network_receive_bytes_total[5m])) * 8 / 1e9",
"legendFormat": "Receive"
}
],
"type": "graph"
}
]
},
{
"title": "Instance Health",
"panels": [
{
"title": "Redis Instances (Up/Down)",
"targets": [
{
"expr": "count(up{job=\"redis\"} == 1)",
"legendFormat": "Up"
},
{
"expr": "count(up{job=\"redis\"} == 0)",
"legendFormat": "Down"
}
],
"type": "singlestat"
},
{
"title": "Proxy Instances (Up/Down)",
"targets": [
{
"expr": "count(up{job=\"proxy\"} == 1)",
"legendFormat": "Up"
},
{
"expr": "count(up{job=\"proxy\"} == 0)",
"legendFormat": "Down"
}
],
"type": "singlestat"
}
]
}
]
}
}

Dashboard 2: Redis Performance

Key panels:

  • Operations per second (instantaneous)
  • Command latency (p50, p95, p99)
  • Memory usage (used, max, fragmentation)
  • Eviction rate (keys evicted/sec)
  • Cache hit rate (hits / (hits + misses))
  • Replication lag (seconds behind master)
  • Cluster health (slots assigned, state)

Dashboard 3: Proxy Performance

Key panels:

  • Requests per second
  • Request latency (p50, p95, p99) by operation type
  • Error rate (errors/sec, % of total)
  • Cache hit rate (hot tier hit %)
  • Backend latency breakdown (Redis, PostgreSQL, S3)
  • Active connections
  • Connection pool utilization

Dashboard 4: Network Topology

Key panels:

  • Cross-AZ traffic (% of total traffic)
  • Bandwidth utilization by AZ
  • Packet loss rate
  • Cross-AZ latency (average, p95, p99)
  • Data transfer costs (estimated monthly)

Dashboard 5: Cost Tracking

Key panels:

  • Instance hours by type (r6i.4xlarge, c6i.2xlarge)
  • Data transfer (intra-AZ, cross-AZ, internet)
  • Storage utilization (EBS, S3)
  • Estimated monthly cost (running total)

Dashboard Query Performance

Query Example (proxy latency p99):

histogram_quantile(0.99,
sum(rate(prism_proxy_requests_duration_seconds_bucket[5m])) by (le)
)

Query Execution Time (from MEMO-074 benchmarks):

  • 30-day retention: <500ms
  • 7-day retention: <200ms
  • Real-time (last 5 minutes): <50ms

Assessment: ✅ Dashboard queries performant for operational use


Distributed Tracing (Jaeger)

Jaeger Architecture

Deployment Strategy: All-in-one for dev, production deployment with Cassandra backend

Jaeger Architecture (Production):

Client (Proxy Nodes)
↓ UDP 6831 (jaeger.thrift compact)
Jaeger Agent (DaemonSet on each node)
↓ gRPC
Jaeger Collector (replicas: 3)
↓ Write spans
Cassandra Cluster (3 nodes, RF=3)
↑ Read spans
Jaeger Query Service (replicas: 2)
↑ HTTP 16686
Grafana Explore

Benefits:

  • ✅ Low-latency span submission (UDP to local agent)
  • ✅ Buffering at collector (handles burst traffic)
  • ✅ Scalable storage (Cassandra horizontal scaling)
  • ✅ High availability (3 collectors, 2 query services)

Jaeger Deployment

Jaeger Agent (DaemonSet):

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: jaeger-agent
namespace: prism-observability
spec:
selector:
matchLabels:
app: jaeger-agent
template:
metadata:
labels:
app: jaeger-agent
spec:
hostNetwork: true
containers:
- name: jaeger-agent
image: jaegertracing/jaeger-agent:1.51.0
args:
- --reporter.grpc.host-port=jaeger-collector.prism-observability.svc.cluster.local:14250
- --reporter.grpc.retry.max=10
ports:
- name: compact
containerPort: 6831
protocol: UDP
- name: binary
containerPort: 6832
protocol: UDP
- name: admin
containerPort: 14271
protocol: TCP
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"

Jaeger Collector:

apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
namespace: prism-observability
spec:
replicas: 3
selector:
matchLabels:
app: jaeger-collector
template:
metadata:
labels:
app: jaeger-collector
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:1.51.0
args:
- --cassandra.keyspace=jaeger_v1_dc1
- --cassandra.servers=cassandra.prism-observability.svc.cluster.local
- --cassandra.username=jaeger
- --cassandra.password=$(CASSANDRA_PASSWORD)
- --collector.zipkin.host-port=:9411
- --collector.num-workers=50
- --collector.queue-size=10000
env:
- name: CASSANDRA_PASSWORD
valueFrom:
secretKeyRef:
name: jaeger-cassandra-secret
key: password
ports:
- name: grpc
containerPort: 14250
- name: http
containerPort: 14268
- name: zipkin
containerPort: 9411
- name: admin
containerPort: 14269
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"

Jaeger Query Service:

apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-query
namespace: prism-observability
spec:
replicas: 2
selector:
matchLabels:
app: jaeger-query
template:
metadata:
labels:
app: jaeger-query
spec:
containers:
- name: jaeger-query
image: jaegertracing/jaeger-query:1.51.0
args:
- --cassandra.keyspace=jaeger_v1_dc1
- --cassandra.servers=cassandra.prism-observability.svc.cluster.local
- --cassandra.username=jaeger
- --cassandra.password=$(CASSANDRA_PASSWORD)
env:
- name: CASSANDRA_PASSWORD
valueFrom:
secretKeyRef:
name: jaeger-cassandra-secret
key: password
ports:
- name: query
containerPort: 16686
- name: admin
containerPort: 16687
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"

Cassandra Backend (for Jaeger)

StatefulSet (3 nodes, replication factor 3):

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
namespace: prism-observability
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cassandra
topologyKey: kubernetes.io/hostname

containers:
- name: cassandra
image: cassandra:4.1
env:
- name: CASSANDRA_CLUSTER_NAME
value: "prism-jaeger"
- name: CASSANDRA_DC
value: "DC1"
- name: CASSANDRA_RACK
value: "Rack1"
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.prism-observability.svc.cluster.local"
ports:
- name: cql
containerPort: 9042
- name: gossip
containerPort: 7000
volumeMounts:
- name: cassandra-data
mountPath: /var/lib/cassandra
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"

volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 500Gi

Cassandra Schema (initialized via CQL):

CREATE KEYSPACE IF NOT EXISTS jaeger_v1_dc1
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

USE jaeger_v1_dc1;

CREATE TABLE IF NOT EXISTS traces (
trace_id blob,
span_id bigint,
span_hash bigint,
parent_id bigint,
operation_name text,
flags int,
start_time timestamp,
duration bigint,
tags list<frozen<tag>>,
logs list<frozen<log>>,
refs list<frozen<span_ref>>,
process frozen<process>,
PRIMARY KEY (trace_id, span_id, span_hash)
) WITH compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': 1, 'compaction_window_unit': 'HOURS'};

-- Indexes for efficient querying
CREATE INDEX IF NOT EXISTS traces_start_time_idx ON traces (start_time);
CREATE INDEX IF NOT EXISTS traces_operation_idx ON traces (operation_name);

Trace Sampling Strategy

Sampling Configuration:

# Sampling per MEMO-074 analysis: 1% sampling for 1.1B ops/sec = 11M spans/sec
apiVersion: v1
kind: ConfigMap
metadata:
name: jaeger-sampling-config
namespace: prism-observability
data:
sampling.json: |
{
"default_strategy": {
"type": "probabilistic",
"param": 0.01
},
"per_operation_strategies": {
"prism-proxy": [
{
"operation": "GetVertex",
"type": "probabilistic",
"param": 0.01
},
{
"operation": "GetEdges",
"type": "probabilistic",
"param": 0.01
},
{
"operation": "TraverseGraph",
"type": "probabilistic",
"param": 0.1
},
{
"operation": "HealthCheck",
"type": "probabilistic",
"param": 0.0001
}
]
}
}

Sampling Rationale:

  • 1% default: 11M spans/sec (manageable by Cassandra)
  • 10% for complex queries (traversals): Higher sampling for operations we care most about
  • 0.01% for health checks: Reduce noise from high-frequency low-value operations

Span Volume:

Expected spans per day:
- GetVertex (70% of traffic): 1.1B ops/sec × 0.7 × 0.01 = 7.7M spans/sec
- GetEdges (20%): 1.1B × 0.2 × 0.01 = 2.2M spans/sec
- TraverseGraph (10%): 1.1B × 0.1 × 0.1 = 11M spans/sec
Total: 21M spans/sec

Daily: 21M spans/sec × 86,400 = 1.8 trillion spans/day

Average span size: 1 KB
Daily storage: 1.8T × 1 KB = 1.8 TB/day

7-day retention: 1.8 TB × 7 = 12.6 TB
Cassandra compression (5:1): 12.6 TB / 5 = 2.5 TB

Allocated storage (3 nodes × 500 GB × 3 RF): 4.5 TB
Utilization: 2.5 TB / 4.5 TB = 56%

Assessment: ✅ Storage capacity sufficient for 7-day trace retention


Trace Instrumentation (Proxy)

Rust OpenTelemetry Integration:

use opentelemetry::{global, trace::{Tracer, SpanKind}, KeyValue};
use opentelemetry_jaeger::new_pipeline;
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{layer::SubscriberExt, Registry};

// Initialize Jaeger tracer
let tracer = new_pipeline()
.with_service_name("prism-proxy")
.with_agent_endpoint("localhost:6831")
.with_trace_config(
opentelemetry::sdk::trace::config()
.with_sampler(opentelemetry::sdk::trace::Sampler::TraceIdRatioBased(0.01))
)
.install_batch(opentelemetry::runtime::Tokio)?;

// Set up tracing subscriber
let telemetry = OpenTelemetryLayer::new(tracer);
let subscriber = Registry::default().with(telemetry);
tracing::subscriber::set_global_default(subscriber)?;

// Instrument function with tracing
#[tracing::instrument(skip(self))]
async fn get_vertex(&self, vertex_id: &str) -> Result<Vertex, Error> {
let span = tracing::info_span!("get_vertex", vertex.id = %vertex_id);
let _enter = span.enter();

// 1. Query PostgreSQL for partition metadata
let partition_span = tracing::info_span!("query_partition_metadata");
let _partition_enter = partition_span.enter();
let partition_id = self.postgres.get_partition(vertex_id).await?;
drop(_partition_enter);

// 2. Check hot tier (Redis)
let redis_span = tracing::info_span!("redis_get", partition = %partition_id);
let _redis_enter = redis_span.enter();
if let Some(vertex) = self.redis.get(vertex_id).await? {
return Ok(vertex);
}
drop(_redis_enter);

// 3. Load from cold tier (S3)
let s3_span = tracing::info_span!("s3_load_partition", partition = %partition_id);
let _s3_enter = s3_span.enter();
let partition = self.s3.load_partition(partition_id).await?;
drop(_s3_enter);

// 4. Promote to hot tier
let promote_span = tracing::info_span!("promote_to_hot_tier");
let _promote_enter = promote_span.enter();
self.redis.set(vertex_id, &partition.get_vertex(vertex_id)).await?;
drop(_promote_enter);

Ok(partition.get_vertex(vertex_id))
}

Trace Example (GetVertex span):

{
"traceID": "5a2d3f8b4c1e6a7b",
"spanID": "1234567890abcdef",
"operationName": "get_vertex",
"startTime": 1700000000000000,
"duration": 45000,
"tags": [
{"key": "vertex.id", "value": "user:123456"},
{"key": "partition.id", "value": "42"},
{"key": "cache.hit", "value": "false"},
{"key": "tier", "value": "cold"}
],
"logs": [],
"references": [],
"process": {
"serviceName": "prism-proxy",
"tags": [
{"key": "hostname", "value": "proxy-node-123"},
{"key": "ip", "value": "10.0.10.50"}
]
},
"children": [
{
"spanID": "2345678901bcdef0",
"operationName": "query_partition_metadata",
"duration": 2000,
"tags": [{"key": "db.type", "value": "postgresql"}]
},
{
"spanID": "3456789012cdef01",
"operationName": "redis_get",
"duration": 800,
"tags": [
{"key": "db.type", "value": "redis"},
{"key": "cache.hit", "value": "false"}
]
},
{
"spanID": "456789013def0123",
"operationName": "s3_load_partition",
"duration": 35000,
"tags": [
{"key": "storage.type", "value": "s3"},
{"key": "partition.size", "value": "100MB"}
]
},
{
"spanID": "56789014ef012345",
"operationName": "promote_to_hot_tier",
"duration": 1200,
"tags": [{"key": "db.type", "value": "redis"}]
}
]
}

Trace Query (in Jaeger UI):

  • Service: prism-proxy
  • Operation: get_vertex
  • Tags: cache.hit=false, tier=cold
  • Duration: >40ms
  • Result: Shows all cold tier accesses taking >40ms

Logging (Loki)

Loki Architecture

Deployment Strategy: Microservices mode with S3 backend

Loki Architecture:

Proxy Nodes / Redis / PostgreSQL
↓ HTTP (push logs)
Loki Distributor (replicas: 3)
↓ Write to S3
S3 Bucket: prism-logs
↑ Read from S3
Loki Querier (replicas: 2)
↑ Query logs
Grafana Explore

Loki Deployment

Loki Configuration:

auth_enabled: false

server:
http_listen_port: 3100
grpc_listen_port: 9096

common:
path_prefix: /loki
storage:
s3:
s3: s3://us-west-2/prism-logs
s3forcepathstyle: true
replication_factor: 1

schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v12
index:
prefix: index_
period: 24h

storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3

aws:
s3: s3://us-west-2/prism-logs
region: us-west-2

compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m

limits_config:
retention_period: 168h # 7 days
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_length: 720h
max_query_lookback: 720h

chunk_store_config:
max_look_back_period: 168h

table_manager:
retention_deletes_enabled: true
retention_period: 168h

Distributor Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: loki-distributor
namespace: prism-observability
spec:
replicas: 3
selector:
matchLabels:
app: loki
component: distributor
template:
metadata:
labels:
app: loki
component: distributor
spec:
containers:
- name: loki
image: grafana/loki:2.9.3
args:
- -config.file=/etc/loki/config.yaml
- -target=distributor
ports:
- name: http
containerPort: 3100
- name: grpc
containerPort: 9096
volumeMounts:
- name: config
mountPath: /etc/loki
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
volumes:
- name: config
configMap:
name: loki-config

Querier Deployment (similar structure, -target=querier).


Log Collection (Promtail)

Promtail DaemonSet (collects logs from all nodes):

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: prism-observability
spec:
selector:
matchLabels:
app: promtail
template:
metadata:
labels:
app: promtail
spec:
serviceAccountName: promtail
containers:
- name: promtail
image: grafana/promtail:2.9.3
args:
- -config.file=/etc/promtail/config.yaml
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers

Promtail Configuration:

server:
http_listen_port: 9080
grpc_listen_port: 0

positions:
filename: /tmp/positions.yaml

clients:
- url: http://loki-distributor.prism-observability.svc.cluster.local:3100/loki/api/v1/push

scrape_configs:
# Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
pipeline_stages:
- docker: {}
- json:
expressions:
level: level
msg: msg
trace_id: trace_id
latency_ms: latency_ms
- labels:
level:
trace_id:

# System logs
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslog

Structured Logging (Proxy)

Rust Logging Setup:

use tracing::{info, error, warn};
use tracing_subscriber::fmt::format::FmtSpan;

// Initialize structured logging (JSON format)
tracing_subscriber::fmt()
.json()
.with_max_level(tracing::Level::INFO)
.with_span_events(FmtSpan::CLOSE)
.init();

// Example log statement
#[tracing::instrument(skip(self))]
async fn handle_request(&self, req: Request) -> Result<Response, Error> {
info!(
request.id = %req.id,
request.method = %req.method,
request.path = %req.path,
"Received request"
);

let start = Instant::now();
let result = self.process_request(req).await;
let latency_ms = start.elapsed().as_millis();

match result {
Ok(resp) => {
info!(
request.id = %req.id,
response.status = %resp.status,
latency_ms = %latency_ms,
"Request completed successfully"
);
Ok(resp)
}
Err(e) => {
error!(
request.id = %req.id,
error = %e,
latency_ms = %latency_ms,
"Request failed"
);
Err(e)
}
}
}

Log Output (JSON):

{
"timestamp": "2025-11-16T12:00:00.123456Z",
"level": "INFO",
"target": "prism_proxy",
"fields": {
"message": "Request completed successfully",
"request.id": "req-abc123",
"response.status": 200,
"latency_ms": 3,
"trace_id": "5a2d3f8b4c1e6a7b",
"span.name": "handle_request",
"span.id": "1234567890abcdef"
}
}

Log Retention and Costs

Log Volume:

Log generation:
- Proxy nodes: 1000 × 100 logs/sec = 100,000 logs/sec
- Redis: 1000 × 10 logs/sec = 10,000 logs/sec
- PostgreSQL: 4 × 50 logs/sec = 200 logs/sec
Total: ~110,000 logs/sec

Average log size: 500 bytes (JSON)
Daily volume: 110,000 × 500 bytes × 86,400 = 4.75 GB/day
Weekly volume (7 days): 4.75 GB × 7 = 33.25 GB

Compression (5:1): 33.25 GB / 5 = 6.65 GB
S3 storage cost: 6.65 GB × $0.023/GB = $0.15/month

Assessment: ✅ Negligible storage cost for logs

Alerting (Alertmanager)

Alertmanager Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: prism-observability
spec:
replicas: 2
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.26.0
args:
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager
- --cluster.listen-address=0.0.0.0:9094
- --cluster.peer=alertmanager-0.alertmanager.prism-observability.svc.cluster.local:9094
- --cluster.peer=alertmanager-1.alertmanager.prism-observability.svc.cluster.local:9094
ports:
- name: web
containerPort: 9093
- name: cluster
containerPort: 9094
volumeMounts:
- name: config
mountPath: /etc/alertmanager
- name: storage
mountPath: /alertmanager
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: config
configMap:
name: alertmanager-config
- name: storage
emptyDir: {}

Alertmanager Configuration

global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
receiver: 'slack-default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true

# Warning alerts → Slack
- match:
severity: warning
receiver: 'slack-warnings'

# Info alerts → Email
- match:
severity: info
receiver: 'email-info'

receivers:
- name: 'slack-default'
slack_configs:
- channel: '#prism-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true

- name: 'slack-warnings'
slack_configs:
- channel: '#prism-warnings'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true

- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'
severity: 'critical'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'

- name: 'email-info'
email_configs:
- to: 'prism-alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
headers:
Subject: 'Prism Alert: {{ .GroupLabels.alertname }}'

Alert Rules

Redis Alerts (prometheus-rules/redis.yml):

groups:
- name: redis-alerts
interval: 30s
rules:
- alert: RedisDown
expr: up{job="redis"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis instance down"
description: "Redis instance {{ $labels.instance }} is down for more than 1 minute"
runbook: "https://runbooks.example.com/redis-down"

- alert: RedisMemoryHigh
expr: (redis_memory_used_bytes / redis_memory_max_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage high"
description: "Redis instance {{ $labels.instance }} memory usage is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/redis-memory"

- alert: RedisEvictionRate
expr: rate(redis_evicted_keys_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Redis eviction rate high"
description: "Redis instance {{ $labels.instance }} evicting {{ $value }} keys/sec"
runbook: "https://runbooks.example.com/redis-evictions"

- alert: RedisReplicationLag
expr: redis_replication_lag_seconds > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Redis replication lag high"
description: "Redis replica {{ $labels.instance }} lagging {{ $value }}s behind master"
runbook: "https://runbooks.example.com/redis-replication-lag"

- alert: RedisClusterDown
expr: redis_cluster_state == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis cluster unhealthy"
description: "Redis cluster {{ $labels.cluster }} is in failed state"
runbook: "https://runbooks.example.com/redis-cluster-down"

Proxy Alerts (prometheus-rules/proxy.yml):

groups:
- name: proxy-alerts
interval: 30s
rules:
- alert: ProxyHighErrorRate
expr: (rate(prism_proxy_requests_errors_total[5m]) / rate(prism_proxy_requests_total[5m])) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Proxy error rate high"
description: "Proxy {{ $labels.instance }} error rate is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/proxy-errors"

- alert: ProxyHighLatency
expr: histogram_quantile(0.99, rate(prism_proxy_requests_duration_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Proxy p99 latency high"
description: "Proxy {{ $labels.instance }} p99 latency is {{ $value }}s"
runbook: "https://runbooks.example.com/proxy-latency"

- alert: ProxyCacheMissRate
expr: prism_proxy_cache_hit_rate < 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Proxy cache hit rate low"
description: "Proxy {{ $labels.instance }} cache hit rate is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/cache-miss"

- alert: ProxyConnectionPoolExhausted
expr: prism_proxy_active_connections / prism_proxy_connection_pool_size > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Proxy connection pool nearly exhausted"
description: "Proxy {{ $labels.instance }} using {{ $value | humanizePercentage }} of connection pool"
runbook: "https://runbooks.example.com/connection-pool"

Infrastructure Alerts (prometheus-rules/infrastructure.yml):

groups:
- name: infrastructure-alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "Instance {{ $labels.instance }} CPU usage is {{ $value }}%"
runbook: "https://runbooks.example.com/high-cpu"

- alert: HighMemoryUsage
expr: ((1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100) > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage"
description: "Instance {{ $labels.instance }} memory usage is {{ $value }}%"
runbook: "https://runbooks.example.com/high-memory"

- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low"
description: "Instance {{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value | humanizePercentage }} free"
runbook: "https://runbooks.example.com/disk-space"

- alert: NetworkThroughputHigh
expr: (rate(node_network_transmit_bytes_total[5m]) * 8 / 1e9) > 8
for: 10m
labels:
severity: warning
annotations:
summary: "Network throughput approaching limit"
description: "Instance {{ $labels.instance }} network throughput is {{ $value }} Gbps (80% of 10 Gbps)"
runbook: "https://runbooks.example.com/network-saturation"

Observability Costs

Monthly Cost Summary

ComponentCost/monthNotes
Prometheus (self-hosted)$1,8545 instances (3 local + 2 global) × c6i.xlarge ($0.17/hour reserved) × 730 hours + 2.5 TB EBS ($0.08/GB)
Grafana (self-hosted)$2482 instances × c6i.large ($0.085/hour reserved) × 730 hours + 100 GB EBS
Jaeger$1,8543 collectors + 2 query × c6i.large + 3 agents (DaemonSet, no cost)
Cassandra (Jaeger backend)$1,1123 nodes × c6i.2xlarge ($0.17/hour reserved) × 730 hours + 1.5 TB EBS
Loki$3723 distributors + 2 queriers × t3.medium ($0.0416/hour on-demand) × 730 hours
S3 storage$407Thanos (200 GB @ $0.023/GB) + Loki logs (33 GB) + Jaeger overflow
Total$5,847vs $35,502 CloudWatch-only (MEMO-077), 84% reduction

Cost Breakdown Rationale:

  • ✅ Self-hosted Prometheus: $30K/month savings vs CloudWatch Metrics (100K custom metrics)
  • ✅ S3 for long-term storage: 96% cheaper than CloudWatch retention
  • ✅ Cassandra for traces: Cheaper than managed tracing (AWS X-Ray at $5/1M spans = $105K/month for 21M spans/sec)
  • ⚠️ Operational overhead: Requires SRE team to manage observability stack

Total Infrastructure + Observability Costs:

  • Infrastructure (MEMO-077): $938,757/month
  • Observability: $5,847/month
  • Total: $944,604/month ($11.3M/year, $34.0M over 3 years)
  • vs MEMO-076 baseline ($32.4M): 5% higher, acceptable for production observability

Recommendations

Primary Recommendation

Deploy self-hosted observability stack with:

  1. Prometheus HA cluster (3 local + 2 global instances, 30-day retention)
  2. Grafana (2 instances for HA, PostgreSQL backend for dashboard persistence)
  3. Jaeger with Cassandra (3 collectors, 2 query services, 7-day trace retention, 1% sampling)
  4. Loki with S3 backend (7-day log retention, structured JSON logs)
  5. Alertmanager (2 instances for HA, PagerDuty + Slack + email receivers)
  6. Thanos (long-term metrics storage in S3, 1-year retention)

Monthly Cost: $5,847 (84% cheaper than CloudWatch-only approach)

Operational Trade-off: Requires SRE team to manage observability infrastructure, but provides:

  • Full control over sampling, retention, costs
  • No vendor lock-in
  • Integration with existing tools (Grafana, Jaeger)
  • Significantly lower costs at scale

Dashboard Priorities

Week 1 (production launch):

  1. Infrastructure overview (compute, memory, network)
  2. Redis performance (ops/sec, latency, memory)
  3. Proxy performance (requests/sec, latency, errors)

Week 2 (operational maturity): 4. Network topology (cross-AZ traffic, bandwidth) 5. Cost tracking (instance hours, data transfer)

Week 3 (deep observability): 6. Distributed tracing integration (Jaeger in Grafana Explore) 7. Log correlation (Loki logs linked to traces)


Alert Tuning Strategy

Phase 1: Conservative (first 30 days):

  • High thresholds to avoid alert fatigue
  • All critical alerts routed to on-call
  • Daily alert review meetings

Phase 2: Calibration (30-90 days):

  • Adjust thresholds based on observed baselines
  • Add warning alerts for leading indicators
  • Tune group_wait and group_interval

Phase 3: Mature (90+ days):

  • Fine-grained alerts with context
  • Auto-remediation for common issues
  • Runbooks tested and updated

Next Steps

Week 19: Development Tooling and CI/CD Pipelines

Focus: Build and deployment automation for continuous delivery

Tasks:

  1. CI/CD pipeline design (GitHub Actions, GitLab CI, or Jenkins)
  2. Docker image builds (Rust proxy, Redis with custom config)
  3. Terraform pipeline (plan, apply, destroy workflows)
  4. Kubernetes manifests management (Helm charts, Kustomize)
  5. Automated testing integration (unit, integration, load tests)

Success Criteria:

  • Automated deployments from Git commits
  • Infrastructure changes reviewed and approved before apply
  • Rollback capability within 5 minutes
  • Blue/green deployment strategy for proxy updates

Appendices

Appendix A: Prometheus Query Examples

Redis Operations per Second:

sum(rate(redis_commands_processed_total[5m])) by (instance)

Proxy p99 Latency:

histogram_quantile(0.99,
sum(rate(prism_proxy_requests_duration_seconds_bucket[5m])) by (le)
)

Cache Hit Rate:

sum(rate(redis_keyspace_hits_total[5m])) /
(sum(rate(redis_keyspace_hits_total[5m])) + sum(rate(redis_keyspace_misses_total[5m])))

Cross-AZ Traffic Percentage:

(sum(rate(node_network_transmit_bytes_total{az!="us-west-2a"}[5m])) /
sum(rate(node_network_transmit_bytes_total[5m]))) * 100

Appendix B: Runbook Template

Title: Redis Instance Down

Severity: Critical

Symptoms:

  • Alert: RedisDown firing
  • Prometheus target redis:9121 unreachable
  • Graph queries returning errors

Investigation:

  1. Check instance health: aws ec2 describe-instance-status --instance-id i-xxxxx
  2. SSH to instance (or use Systems Manager): aws ssm start-session --target i-xxxxx
  3. Check Redis process: systemctl status redis
  4. Check Redis logs: journalctl -u redis -n 100

Resolution:

  1. If process crashed: systemctl restart redis
  2. If instance failed: Terminate instance, Auto Scaling Group will replace
  3. If cluster split-brain: Follow Redis Cluster recovery procedure (link to detailed runbook)

Prevention:

  • Monitor Redis memory usage (alert before OOM)
  • Enable Redis persistence (RDB + AOF)
  • Ensure Auto Scaling Group health checks configured

Appendix C: Observability Validation Checklist

Metrics:

  • All 2000 instances scraped by Prometheus
  • No scrape errors in last 24 hours
  • Prometheus storage utilization <80%
  • Grafana dashboards loading <500ms
  • Alert rules validated (test firing)

Tracing:

  • Jaeger receiving spans from all proxy nodes
  • Trace sampling rate = 1% (measured)
  • Cassandra storage utilization <60%
  • End-to-end traces visible in Grafana Explore
  • Trace-to-log correlation working

Logging:

  • Loki receiving logs from all nodes
  • Logs searchable in Grafana Explore
  • Log volume <10 GB/day
  • S3 storage costs <$1/month
  • No PII in logs (verified with sample queries)

Alerting:

  • PagerDuty integration tested (test alert sent)
  • Slack notifications working
  • Alert grouping configured correctly
  • Runbooks linked from all alerts
  • On-call rotation configured