Skip to main content

RFC-050: Operations Dashboard (HUD)

Status: Draft Author: Platform Team Created: 2025-11-07 Updated: 2025-11-07

Abstract

This RFC defines the Prism Operations Dashboard (internally called "HUD" - Heads-Up Display), a real-time web interface providing comprehensive visibility into Prism proxy health, pattern status, backend connectivity, and system performance.

Core Principle: Detect issues before users notice through proactive monitoring of high-probability failure indicators.

The dashboard provides:

  1. Real-time System Health: Overall status with drill-down to component details
  2. Performance Monitoring: Latency, throughput, and SLO compliance tracking
  3. Pattern & Backend Health: Lifecycle state and connection pool monitoring
  4. Messaging Visibility: PubSub message flow and delivery health
  5. Multi-Tenancy Insights: Per-namespace resource usage and error rates
  6. Security Monitoring: Authentication success rates and authorization metrics
  7. Observability Health: OpenTelemetry pipeline status
  8. Resource Utilization: System-level CPU, memory, and capacity tracking

Technology Stack: Go HTTP server (backend), HTMX + D3.js (frontend), WebSocket for real-time updates, Prometheus metrics + SigNoz traces. Framework-less approach with no build step required (see ADR-061).

Motivation

Problem Statement

Current Situation (Post-POC 1-3):

  • ✅ Proxy spawns and manages patterns successfully
  • ✅ Patterns communicate with backends (MemStore, Redis, NATS)
  • ✅ OpenTelemetry traces sent to SigNoz
  • No unified operational view: Developers must check:
    • SigNoz UI for traces
    • Docker logs for pattern health
    • curl requests for metrics
    • Manual gRPC health checks
    • Backend-specific tools (redis-cli, nats-cli)

Pain Points:

  1. Reactive Debugging: Issues discovered after failures occur
  2. Scattered Information: No single pane of glass for system health
  3. Slow MTTR: Mean Time To Resolution high due to information gathering overhead
  4. No SLO Visibility: Can't track if meeting latency/reliability targets
  5. Pattern Health Blind Spots: Process crashes/restarts go unnoticed
  6. Backend Issues Surface Late: Connection pool exhaustion detected only after timeouts

Goals

  1. Proactive Issue Detection: Catch problems before user impact (target: 95% issues detected within 30s)
  2. Single Pane of Glass: All critical system health in one view
  3. Fast MTTR: Reduce time to identify root cause from minutes to seconds
  4. SLO Tracking: Visualize compliance with latency/reliability targets
  5. Developer Productivity: Eliminate manual metric gathering during debugging
  6. Production Readiness: Dashboard suitable for both local dev and production

Non-Goals

  • Application Metrics: Dashboard focuses on Prism infrastructure, not application-level business metrics
  • Log Aggregation: Logs remain in SigNoz; dashboard shows summary insights only
  • Alerting Engine: Dashboard surfaces metrics but doesn't replace alert manager (use SigNoz alerts)
  • Historical Analysis: Focus on real-time (last 24h), not long-term trends (use SigNoz for that)
  • Multi-Cluster Management: Single cluster/proxy instance view (multi-cluster is future work)

Architecture Overview

System Context

Data Flow

  1. Proxy & Patterns: Emit metrics (Prometheus format) + traces (OTLP)
  2. Dashboard Backend: Aggregates data from multiple sources every 5 seconds
  3. WebSocket Push: Real-time updates to browser clients
  4. Frontend Rendering: HTMX + D3.js display live data with <2s latency

Technology Stack

See ADR-061: Framework-Less Web UI for complete rationale.

Backend (dashboard/):

  • Language: Go
  • HTTP Server: net/http standard library + gorilla/mux
  • Templates: Go html/template (server-side rendering)
  • Metrics Scraping: Native Prometheus text format parser
  • gRPC Client: Native Go gRPC client (call pattern health checks)
  • SigNoz Integration: net/http client (query SigNoz API)
  • WebSocket: gorilla/websocket (push updates to frontend)
  • Caching: Optional in-memory cache (reduce SigNoz query load)

Frontend (embedded in dashboard/static/ and dashboard/templates/):

  • HTML: Go templates (server-rendered, type-safe)
  • Interactivity: HTMX 1.9+ (14KB, replaces React)
  • Visualization: D3.js v7 (~70KB, best-in-class charting)
  • Diagrams: Mermaid.js (~200KB, text-to-diagram)
  • Styling: Plain CSS or Tailwind (optional)
  • WebSocket Client: Native browser WebSocket API
  • Build Step: NONE - no npm, no webpack, instant reload

Deployment:

  • Binary: Single Go executable (go build)
  • Assets: Embedded with //go:embed directive (templates + static files)
  • Development: make dashboard-run starts server (instant reload with air)
  • Production: Single binary or Docker container (no Node.js runtime needed)

Dashboard Views

View Hierarchy

The dashboard uses a hub-and-spoke model with a primary System Health view and drill-down panels:

┌─────────────────────────────────────────────────────────┐
│ System Health (Hub) │ ← Always visible
│ Overall status, critical metrics, pattern grid │
├─────────────────────────────────────────────────────────┤
│ Drill-Down Panels (Spoke) │ ← Accessed via tabs/clicks
│ ┌──────────────┬──────────────┬──────────────┐ │
│ │ Performance │ Backend │ Messaging │ │
│ │ Monitoring │ Health │ Flow │ │
│ ├──────────────┼──────────────┼──────────────┤ │
│ │ Multi- │ Security │ Observability│ │
│ │ Tenancy │ Monitoring │ Health │ │
│ ├──────────────┼──────────────┼──────────────┤ │
│ │ System │ Alerts │ Settings │ │
│ │ Resources │ History │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
└─────────────────────────────────────────────────────────┘

View 1: System Health (Primary View)

Priority: CRITICAL (Always visible) Detection Probability: 99% Update Frequency: 2 seconds

Purpose: Single-glance answer to "Is Prism working?"

Layout

┌────────────────────────────────────────────────────────────────┐
│ Prism Operations Dashboard [Auto-refresh: 2s]│
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ 🟢 SYSTEM HEALTHY ││
│ │ ││
│ │ ✅ 99.97% Success Rate 📊 8,432 RPS ⚡ 0.8ms P99 ││
│ │ 🔒 99.8% Auth Success 🔗 3 Patterns 📦 15 Namespaces ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Pattern Health Grid │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Pattern Status Phase Uptime Restarts │ │
│ ├──────────────────────────────────────────────────────────┤ │
│ │ MemStore 🟢 HEALTHY Running 4d 3h 0 │ │
│ │ Redis 🟢 HEALTHY Running 4d 3h 0 │ │
│ │ NATS 🟡 DEGRADED Running 2h 15m 2 │◄─┐
│ │ PostgreSQL 🟢 HEALTHY Running 4d 3h 0 │ │
│ └──────────────────────────────────────────────────────────┘ │
│ [Click row for details] ──┘
│ │
├────────────────────────────────────────────────────────────────┤
│ Critical Metrics (Last 5 minutes) │
│ ┌────────────┬────────────┬────────────┬────────────────────┐ │
│ │ Latency │ Throughput │ Error Rate │ Backend Pools │ │
│ ├────────────┼────────────┼────────────┼────────────────────┤ │
│ │ P50: 0.4ms │ Read: 5.2k│ 0.03% │ Redis: 7/10 🟢 │ │
│ │ P99: 0.8ms │ Write: 3.2k│ 25 errors │ NATS: Connected 🟡 │ │
│ │ P999: 2.1ms│ Total: 8.4k│ /min │ PG: 3/20 🟢 │ │
│ │ │ │ │ │ │
│ │ [Chart] │ [Chart] │ [Chart] │ [Status Grid] │ │
│ └────────────┴────────────┴────────────┴────────────────────┘ │
│ │
├────────────────────────────────────────────────────────────────┤
│ Recent Alerts │
│ 🟡 2m ago: NATS pattern restarted (restart loop detected) │
│ 🟢 15m ago: Redis pool capacity >90% (now resolved) │
│ [View All Alerts →] │
└────────────────────────────────────────────────────────────────┘

Components

1. System Status Banner

  • Overall Health: Computed from all patterns (GREEN if all healthy, YELLOW if any degraded, RED if any unhealthy)
  • Success Rate: (successful_requests / total_requests) * 100 (last 5 min)
  • RPS: Current requests per second (read + write)
  • P99 Latency: 99th percentile latency (last 5 min)
  • Auth Success: Authentication success rate
  • Active Patterns: Count of patterns in HEALTHY or DEGRADED state
  • Namespaces: Total active namespaces

2. Pattern Health Grid

  • Status: 🟢 HEALTHY / 🟡 DEGRADED / 🔴 UNHEALTHY (from gRPC HealthCheck)
  • Phase: spawn → connect → initialize → start → running
  • Uptime: Time since last start
  • Restarts: Restart count in last 24h
  • Click Action: Drill down to pattern detail view

3. Critical Metrics Charts

  • Latency Chart: Line chart showing P50/P99/P999 over last 5 minutes
  • Throughput Chart: Stacked area chart (read vs write RPS)
  • Error Rate Chart: Bar chart of errors per minute
  • Backend Pools: Connection pool status for each backend

4. Recent Alerts

  • Last 5 alerts/warnings with timestamp and auto-resolution status
  • Color-coded by severity: 🔴 Critical / 🟡 Warning / 🟢 Resolved

Data Sources

// dashboard/handlers/system_health.go

func SystemHealthHandler(w http.ResponseWriter, r *http.Request) {
// Aggregate system health from all data sources

// 1. Scrape proxy metrics (Prometheus)
proxyMetrics, err := scrapePrometheus("http://localhost:8980/metrics")
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}

// 2. Query pattern health checks (gRPC)
patternHealth, err := queryAllPatternsHealth(context.Background())
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}

// 3. Query SigNoz for latency percentiles
latency, err := querySigNozLatency("5m")
if err != nil {
log.Printf("SigNoz query error: %v", err)
// Continue with empty latency data
}

// 4. Query admin API for namespace count
namespaces, err := adminClient.ListNamespaces(context.Background())
if err != nil {
log.Printf("Admin API error: %v", err)
}

data := SystemHealthData{
OverallStatus: computeOverallHealth(patternHealth),
SuccessRate: proxyMetrics.SuccessRate,
RPS: proxyMetrics.RequestsPerSecond,
LatencyP99: latency.P99,
AuthSuccessRate: proxyMetrics.AuthSuccessRate,
Patterns: patternHealth,
NamespaceCount: len(namespaces),
Timestamp: time.Now(),
}

// Render Go template
templates.ExecuteTemplate(w, "dashboard.html", data)
}

WebSocket Push

Client-side (in Go template):

<!-- dashboard/templates/dashboard.html -->
<script>
// Native WebSocket API (no React hooks needed)
const ws = new WebSocket('ws://' + location.host + '/ws/system-health');

ws.onmessage = (event) => {
const data = JSON.parse(event.data);

// Update status banner
document.getElementById('success-rate').textContent = data.success_rate.toFixed(2) + '%';
document.getElementById('rps').textContent = data.rps;
document.getElementById('latency-p99').textContent = data.latency_p99.toFixed(1) + 'ms';

// Update charts with D3.js
updateLatencyChart(data.latency_history);
updateThroughputChart(data.throughput_history);

// Update last refresh time
document.getElementById('last-update').textContent = new Date().toLocaleTimeString();
};

ws.onerror = (error) => {
console.error('WebSocket error:', error);
document.getElementById('status-indicator').className = 'status-error';
};
</script>

View 2: Performance Monitoring

Priority: HIGH Detection Probability: 95% Update Frequency: 5 seconds

Purpose: Detailed latency, throughput, and SLO compliance tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│ Performance Monitoring │
├────────────────────────────────────────────────────────────────┤
│ │
│ SLO Compliance │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Target: 99.9% requests < 10ms (P99) ││
│ │ Current: 99.7% ✅ ││
│ │ [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 99.7% ││
│ │ ││
│ │ Last 24h: 99.8% ✅ | Last 7d: 99.9% ✅ ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Latency Breakdown (Last 1 hour) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Operation P50 P99 P999 Max Count ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ KeyValue.Set 0.3ms 0.7ms 1.2ms 4.5ms 125k ││
│ │ KeyValue.Get 0.2ms 0.5ms 0.9ms 3.1ms 287k ││
│ │ PubSub.Pub 0.4ms 0.9ms 2.1ms 8.7ms 42k ││
│ │ PubSub.Sub 0.8ms 2.3ms 5.4ms 15.2ms 18k ││
│ └────────────────────────────────────────────────────────────┘│
│ │
│ Latency Heatmap (Time vs Percentile) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ 10ms ┤ ▂▃ ││
│ │ 5ms ┤ ▁▂▃▄▅▆▇███ ││
│ │ 1ms ┤ ▁▂▃▄▅▆▇███ ││
│ │ 0.5ms┤ ▁▂▃▄▅▆▇████████ ││
│ │ 0.1ms┤▁▂▃▄▅▆▇█████ ││
│ │ └────────────────────────────────────────────────────┘││
│ │ 12:00 12:15 12:30 12:45 13:00 13:15 13:30 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Throughput by Pattern │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ [Stacked Area Chart] ││
│ │ 10k ┤ ███ ││
│ │ 8k ┤ ▄▄▄▄▄███ ││
│ │ 6k ┤ ▃▃▃▃▃█████████ ││
│ │ 4k ┤ ▂▂▂▂▂▂██████████████ ││
│ │ 2k ┤ ▁▁▁▁▁▁▁▁▁▁████████████████████ ││
│ │ └────────────────────────────────────────────────────┘││
│ │ MemStore ▀▀▀ Redis ▀▀▀ NATS ▀▀▀ PostgreSQL ▀▀▀ ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Key Features

  1. SLO Compliance Tracker: Visual progress bar showing % of requests meeting latency target
  2. Operation-Level Latency: Breakdown by operation type (Set, Get, Publish, Subscribe)
  3. Latency Heatmap: Visualize latency distribution over time (identify spikes)
  4. Throughput by Pattern: See which patterns are handling most traffic

Data Sources

  • Latency Percentiles: Query SigNoz traces aggregated by operation
  • SLO Compliance: Count requests with latency < 10ms / total requests
  • Throughput: Scrape Prometheus metrics from proxy (prism_requests_total counter)

View 3: Backend Health

Priority: HIGH Detection Probability: 92% Update Frequency: 5 seconds

Purpose: Connection pool health and backend connectivity status

Layout

┌────────────────────────────────────────────────────────────────┐
│ Backend Health │
├────────────────────────────────────────────────────────────────┤
│ │
│ Connection Pools │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Backend Type Active Idle Max Util Status ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ Redis KeyValue 7 3 10 70% 🟢 HEALTHY ││
│ │ PubSub 2 8 10 20% 🟢 HEALTHY ││
│ │ NATS PubSub 1 0 1 100% 🟡 DEGRADED ││
│ │ PostgreSQL KeyValue 3 17 20 15% 🟢 HEALTHY ││
│ │ Queue 5 15 20 25% 🟢 HEALTHY ││
│ │ MemStore KeyValue N/A N/A N/A N/A 🟢 HEALTHY ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Connection Metrics (Last 5 minutes) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Redis ││
│ │ ├─ Connections: [Time series chart] ││
│ │ ├─ Acquisition Time: Avg 2.1ms P99 8.3ms ││
│ │ ├─ Errors: 0 refused, 0 timeout, 0 reset ││
│ │ └─ Pool Capacity: [Progress bar] 70% ││
│ │ ││
│ │ NATS ││
│ │ ├─ Connection State: CONNECTED (reconnects: 2) ││
│ │ ├─ Subscriptions: 127 active ││
│ │ ├─ Stats: In: 42k msgs (8.4 MB) Out: 18k msgs (1.2 MB) ││
│ │ └─ Pending: 0 messages ││
│ │ ││
│ │ PostgreSQL ││
│ │ ├─ Connections: 8/20 active ││
│ │ ├─ Active Queries: 3 ││
│ │ ├─ Query Duration: Avg 12ms P99 45ms ││
│ │ └─ Pool Wait Time: Avg 0.3ms ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Key Features

  1. Connection Pool Table: Shows active/idle/max connections per backend
  2. Utilization Tracking: Visual indicator when approaching capacity (>90% = yellow)
  3. Connection Acquisition Time: How long to get a connection from pool
  4. Error Tracking: Connection refused, timeout, reset counts
  5. Backend-Specific Metrics:
    • Redis: Pool stats from PoolStats()
    • NATS: Connection state, subscription count, message stats
    • PostgreSQL: Active queries, query duration

Data Sources

// patterns/redis/plugin.go

func (r *RedisPlugin) HealthCheck(ctx context.Context) *HealthCheckResponse {
stats := r.client.PoolStats()

return &HealthCheckResponse{
Status: computeStatus(stats),
Metadata: map[string]string{
"connections_active": fmt.Sprintf("%d", stats.TotalConns - stats.IdleConns),
"connections_idle": fmt.Sprintf("%d", stats.IdleConns),
"pool_size": fmt.Sprintf("%d", r.config.PoolSize),
"utilization_pct": fmt.Sprintf("%.1f", float64(stats.TotalConns - stats.IdleConns) / float64(r.config.PoolSize) * 100),
},
}
}

View 4: Messaging Flow (PubSub)

Priority: MEDIUM-HIGH Detection Probability: 88% Update Frequency: 5 seconds

Purpose: PubSub message delivery health and subscriber tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│ Messaging Flow (PubSub) │
├────────────────────────────────────────────────────────────────┤
│ │
│ Message Flow Health │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Published: 42,187 msgs | Delivered: 126,561 msgs ✅ ││
│ │ Dropped: 12 msgs (0.03%) | Pending: 0 msgs ││
│ │ ││
│ │ Delivery Ratio: 3.0x (fanout working correctly) ││
│ │ Delivery Latency: P99 2.3ms ✅ ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Active Topics & Subscribers │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Topic Subs Pub/sec Del/sec Latency Status ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ events.user 3 142 426 1.2ms 🟢 ││
│ │ events.system 5 87 435 0.9ms 🟢 ││
│ │ logs.application 1 523 523 0.4ms 🟢 ││
│ │ alerts.critical 12 2 24 15.2ms 🟡 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Message Timeline (Last 5 minutes) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Published ▀▀▀ ││
│ │ Delivered ▀▀▀ ││
│ │ Dropped ▀▀▀ ││
│ │ ││
│ │ [Multi-line chart showing published vs delivered vs dropped]││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Subscriber Details │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Subscriber ID Topic Msgs Recv Lag ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ sub-worker-1 events.user 14,235 0ms ││
│ │ sub-worker-2 events.user 14,190 0ms ││
│ │ sub-worker-3 events.user 14,201 0ms ││
│ │ sub-analytics events.system 8,745 0ms ││
│ │ sub-logger logs.application 52,301 125ms 🟡 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Key Features

  1. Message Flow Summary: Published vs Delivered (should be N:1 for fanout)
  2. Dropped Messages: Count and percentage (should be near zero)
  3. Delivery Latency: Time from publish to subscriber receive
  4. Topic Breakdown: Per-topic subscriber count and throughput
  5. Subscriber Lag: Identify slow subscribers (lag >100ms = warning)

Data Sources

// patterns/nats/plugin.go

func (n *NATSPlugin) HealthCheck(ctx context.Context) *HealthCheckResponse {
stats := n.conn.Stats()

return &HealthCheckResponse{
Status: HEALTHY,
Metadata: map[string]string{
"subscription_count": fmt.Sprintf("%d", len(n.subscriptions)),
"in_msgs": fmt.Sprintf("%d", stats.InMsgs),
"out_msgs": fmt.Sprintf("%d", stats.OutMsgs),
"in_bytes": fmt.Sprintf("%d", stats.InBytes),
"out_bytes": fmt.Sprintf("%d", stats.OutBytes),
"dropped_msgs": fmt.Sprintf("%d", n.droppedMessageCount),
},
}
}

View 5: Multi-Tenancy (Namespaces)

Priority: MEDIUM Detection Probability: 85% Update Frequency: 10 seconds

Purpose: Per-namespace resource usage and error tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│ Multi-Tenancy (Namespaces) │
├────────────────────────────────────────────────────────────────┤
│ │
│ Active Namespaces (15 total) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Namespace RPS Latency Errors Patterns Status ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ user-platform 4,231 0.8ms 0.02% KV, PS 🟢 ││
│ │ payments 1,847 1.2ms 0.01% KV, Q 🟢 ││
│ │ analytics 892 2.3ms 0.05% PS, TS 🟢 ││
│ │ notifications 645 0.9ms 1.2% PS 🟡 ││
│ │ search-index 387 5.4ms 0.03% KV, G 🟢 ││
│ │ ... (10 more) ... ... ... ... ... ││
│ └────────────────────────────────────────────────────────────┘│
│ [Show All →] │
│ │
├────────────────────────────────────────────────────────────────┤
│ Traffic Distribution (Top 10 by RPS) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ [Pie Chart] ││
│ │ ││
│ │ user-platform: 50.2% ││
│ │ payments: 21.9% ││
│ │ analytics: 10.6% ││
│ │ notifications: 7.6% ││
│ │ others: 9.7% ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Namespace Details: user-platform │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Configuration ││
│ │ ├─ Patterns: KeyValue (Redis), PubSub (NATS) ││
│ │ ├─ Created: 2025-10-15 ││
│ │ ├─ Owner: team-user-platform@example.com ││
│ │ └─ Max RPS: 10,000 (current: 42%) ││
│ │ ││
│ │ Performance (Last 1 hour) ││
│ │ ├─ P99 Latency: [Chart] ││
│ │ ├─ Throughput: [Chart] ││
│ │ └─ Error Rate: [Chart] ││
│ │ ││
│ │ Top Operations ││
│ │ ├─ KeyValue.Get: 3,201 RPS ││
│ │ ├─ KeyValue.Set: 987 RPS ││
│ │ └─ PubSub.Publish: 43 RPS ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Key Features

  1. Namespace Table: RPS, latency, error rate per namespace
  2. Traffic Distribution: Identify noisy neighbors (>80% of traffic)
  3. Namespace Drill-Down: Detailed metrics for selected namespace
  4. Capacity Tracking: Show RPS vs max configured capacity
  5. Pattern Usage: Which patterns each namespace uses

Data Sources

  • Admin API: Query namespace configurations
  • SigNoz Traces: Filter by namespace tag for per-namespace metrics
  • Proxy Metrics: Scrape prism_requests_total{namespace="..."} label

View 6: Security Monitoring

Priority: MEDIUM Detection Probability: 80% Update Frequency: 10 seconds

Purpose: Authentication and authorization tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│ Security Monitoring │
├────────────────────────────────────────────────────────────────┤
│ │
│ Authentication Health │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ JWT Validation Success: 99.8% ✅ ││
│ │ Token Refreshes: 127 (last 1h) ││
│ │ Failed Attempts: 8 (last 1h) ││
│ │ Dex Connectivity: 🟢 CONNECTED ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Failed Authentication Attempts (Last 24 hours) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Time Reason Source IP User ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ 13:24:15 Token expired 10.0.1.45 alice@ ││
│ │ 13:18:42 Invalid signature 10.0.2.12 unknown ││
│ │ 12:45:33 Token expired 10.0.1.45 alice@ ││
│ │ 11:32:18 Missing token 10.0.3.88 unknown ││
│ │ 11:15:07 Invalid issuer 10.0.2.99 unknown ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Authorization (Last 1 hour) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Total Requests: 508,432 ││
│ │ Authorized: 508,401 (99.99%) ││
│ │ Denied: 31 (0.01%) ││
│ │ ││
│ │ Denial Reasons: ││
│ │ ├─ Namespace access denied: 18 ││
│ │ ├─ Pattern not allowed: 8 ││
│ │ ├─ Rate limit exceeded: 5 ││
│ │ └─ Invalid operation: 0 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ JWKS Cache │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Cache Hits: 508,401 (99.99%) ││
│ │ Cache Misses: 31 (0.01%) ││
│ │ Last Refresh: 13:15:42 (15 minutes ago) ││
│ │ Next Refresh: 13:45:42 (in 15 minutes) ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Key Features

  1. Auth Success Rate: Should be >99% (excludes expected token expirations)
  2. Failed Attempts Log: Investigate suspicious patterns (same IP, repeated failures)
  3. Authorization Denials: Track why requests are denied
  4. JWKS Cache Health: Ensure public key cache is working

View 7: Observability Health

Priority: MEDIUM-LOW Detection Probability: 75% Update Frequency: 30 seconds

Purpose: Verify OpenTelemetry pipeline is working

Layout

┌────────────────────────────────────────────────────────────────┐
│ Observability Health │
├────────────────────────────────────────────────────────────────┤
│ │
│ OpenTelemetry Pipeline Status │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Component Status Last Exported ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ Prism Proxy → SigNoz 🟢 HEALTHY 2 seconds ago ││
│ │ Patterns → SigNoz 🟢 HEALTHY 3 seconds ago ││
│ │ OTLP Collector 🟢 HEALTHY 1 second ago ││
│ │ SigNoz (Query Service) 🟢 HEALTHY 5 seconds ago ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Export Metrics (Last 5 minutes) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Traces Exported: 12,487 ││
│ │ Metrics Exported: 52,301 ││
│ │ Logs Exported: 8,945 ││
│ │ ││
│ │ Export Errors: 3 (0.02%) ││
│ │ Export Latency: P99 12ms ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Trace Coverage │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Traces with All Spans: 12,401 (99.3%) ││
│ │ Traces Missing Spans: 86 (0.7%) ││
│ │ ││
│ │ Missing Spans Breakdown: ││
│ │ ├─ Backend span missing: 45 ││
│ │ ├─ Pattern span missing: 31 ││
│ │ └─ Proxy span missing: 10 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Key Features

  1. Pipeline Status: Verify all components exporting telemetry
  2. Export Metrics: Count of traces/metrics/logs exported
  3. Trace Coverage: Identify missing spans (should have proxy + pattern + backend)

View 8: System Resources

Priority: LOW Detection Probability: 70% Update Frequency: 10 seconds

Purpose: CPU, memory, and capacity tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│ System Resources │
├────────────────────────────────────────────────────────────────┤
│ │
│ Prism Proxy │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Memory: 287 MB / 500 MB (57%) [████████▌ ] ││
│ │ CPU: 12% (0.48 cores) ││
│ │ Threads: 24 ││
│ │ File Descriptors: 156 / 1024 (15%) ││
│ │ Uptime: 4d 3h 24m ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Pattern Processes │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Pattern Memory CPU Goroutines Uptime ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ MemStore 12 MB 2% 8 4d 3h ││
│ │ Redis 45 MB 5% 12 4d 3h ││
│ │ NATS 38 MB 8% 15 2h 15m ││
│ │ PostgreSQL 67 MB 3% 20 4d 3h ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ System Totals │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Total Memory: 449 MB ││
│ │ Total CPU: 30% (1.2 cores) ││
│ │ Total Processes: 5 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘

Technical Implementation

See ADR-061: Framework-Less Web UI for complete implementation details and code examples.

This section provides a high-level overview. ADR-061 contains comprehensive Go code examples, HTMX patterns, and D3.js visualization code.

Backend Architecture

dashboard/
├── main.go # Go HTTP server entry point
├── config.go # Configuration (env vars)
├── handlers/
│ ├── dashboard.go # View 1 (System Health)
│ ├── performance.go # View 2
│ ├── backends.go # View 3
│ ├── messaging.go # View 4
│ ├── namespaces.go # View 5
│ ├── security.go # View 6
│ ├── observability.go # View 7
│ ├── resources.go # View 8
│ ├── api.go # JSON API for HTMX
│ └── websocket.go # WebSocket hub
├── collectors/
│ ├── prometheus.go # Scrape proxy metrics
│ ├── grpc_health.go # Query pattern health checks
│ ├── signoz.go # Query SigNoz API
│ └── admin.go # Query Admin API
├── aggregators/
│ ├── system_health.go # Aggregate system health
│ ├── performance.go # Compute percentiles, SLO
│ └── messaging.go # Compute message flow metrics
├── templates/
│ ├── dashboard.html # Main dashboard template
│ ├── performance.html # Performance view template
│ ├── backends.html # Backend health template
│ └── partials/
│ ├── pattern_grid.html # Reusable pattern grid
│ └── metrics_chart.html # Reusable chart component
├── static/
│ ├── css/
│ │ └── dashboard.css # Custom styles
│ ├── js/
│ │ ├── htmx.min.js # HTMX (14KB)
│ │ ├── d3.v7.min.js # D3.js (70KB)
│ │ ├── mermaid.min.js # Mermaid.js (200KB)
│ │ └── dashboard.js # Custom WebSocket + D3 logic
│ └── assets/
│ └── prism-logo.svg # Static assets
└── embed.go # Embed templates/static with go:embed

Go Backend Example

See ADR-061 for complete implementation with all collectors and aggregators.

// dashboard/main.go

package main

import (
"embed"
"html/template"
"log"
"net/http"
"time"

"github.com/gorilla/mux"
"github.com/gorilla/websocket"
)

//go:embed templates/* static/*
var content embed.FS

var (
templates *template.Template
upgrader = websocket.Upgrader{
ReadBufferSize: 1024,
WriteBufferSize: 1024,
}
)

func init() {
templates = template.Must(template.ParseFS(content, "templates/*.html", "templates/partials/*.html"))
}

func main() {
r := mux.NewRouter()

// Serve static files (embedded)
r.PathPrefix("/static/").Handler(http.FileServer(http.FS(content)))

// Page routes (render Go templates)
r.HandleFunc("/", dashboardHandler).Methods("GET")
r.HandleFunc("/performance", performanceHandler).Methods("GET")
r.HandleFunc("/backends", backendsHandler).Methods("GET")

// API routes (JSON/HTML fragments for HTMX)
r.HandleFunc("/api/health", apiHealthHandler).Methods("GET")
r.HandleFunc("/api/patterns", apiPatternsHandler).Methods("GET")
r.HandleFunc("/api/performance", apiPerformanceHandler).Methods("GET")

// WebSocket for real-time updates
r.HandleFunc("/ws", wsHandler)

// Start background data collector
go startCollector()

log.Println("Dashboard starting on :8095")
log.Fatal(http.ListenAndServe(":8095", r))
}

func dashboardHandler(w http.ResponseWriter, r *http.Request) {
data := collectSystemHealth()
templates.ExecuteTemplate(w, "dashboard.html", data)
}

func apiHealthHandler(w http.ResponseWriter, r *http.Request) {
// Return HTML fragment for HTMX partial replacement
data := collectSystemHealth()
templates.ExecuteTemplate(w, "pattern_grid.html", data)
}

func wsHandler(w http.ResponseWriter, r *http.Request) {
conn, err := upgrader.Upgrade(w, r, nil)
if err != nil {
log.Println("WebSocket upgrade error:", err)
return
}
defer conn.Close()

ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()

for range ticker.C {
data := collectSystemHealth()
if err := conn.WriteJSON(data); err != nil {
return
}
}
}

Go Template Example (Server-Rendered HTML)

<!-- dashboard/templates/dashboard.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Prism Operations Dashboard</title>
<link rel="stylesheet" href="/static/css/dashboard.css">
<script src="/static/js/htmx.min.js"></script>
<script src="/static/js/d3.v7.min.js"></script>
</head>
<body>
<header>
<h1>Prism Operations Dashboard</h1>
<div class="refresh-indicator" id="last-update">Last updated: {{.Timestamp.Format "15:04:05"}}</div>
</header>

<main>
<!-- System Status Banner -->
<section class="status-banner status-{{.OverallStatus}}">
<div class="status-icon">{{if eq .OverallStatus "HEALTHY"}}🟢{{else if eq .OverallStatus "DEGRADED"}}🟡{{else}}🔴{{end}}</div>
<div class="status-text">SYSTEM {{.OverallStatus}}</div>
<div class="metrics-summary">
<span class="metric">✅ {{printf "%.2f" .SuccessRate}}% Success Rate</span>
<span class="metric">📊 {{.RPS}} RPS</span>
<span class="metric">⚡ {{printf "%.1f" .LatencyP99}}ms P99</span>
</div>
</section>

<!-- Pattern Health Grid (Auto-refresh with HTMX) -->
<section class="pattern-health">
<h2>Pattern Health</h2>
<div hx-get="/api/patterns" hx-trigger="load, every 5s" hx-swap="innerHTML">
{{template "pattern_grid.html" .}}
</div>
</section>

<!-- Critical Metrics Charts (D3.js) -->
<section class="metrics-charts">
<h2>Critical Metrics (Last 5 minutes)</h2>
<div class="chart-grid">
<div id="latency-chart" class="chart"></div>
<div id="throughput-chart" class="chart"></div>
<div id="error-chart" class="chart"></div>
</div>
</section>

<!-- Recent Alerts -->
<section class="alerts">
<h2>Recent Alerts</h2>
<ul class="alert-list" hx-get="/api/alerts" hx-trigger="load, every 10s" hx-swap="innerHTML">
{{range .RecentAlerts}}
<li class="alert-{{.Severity}}">
<span class="alert-time">{{.Timestamp.Format "15:04"}}</span>
{{.Message}}
</li>
{{end}}
</ul>
</section>
</main>

<script src="/static/js/dashboard.js"></script>
<script>
// Render D3.js charts on load
renderLatencyChart({{.LatencyHistory}});
renderThroughputChart({{.ThroughputHistory}});
renderErrorChart({{.ErrorHistory}});

// WebSocket for real-time updates
const ws = new WebSocket('ws://' + location.host + '/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
updateCharts(data);
document.getElementById('last-update').textContent = 'Last updated: ' + new Date().toLocaleTimeString();
};
</script>
</body>
</html>

Data Flow Sequence

Deployment

Local Development

# Start dashboard (single command, instant reload with air)
cd dashboard
go run main.go

# OR with live reload
air

# Dashboard available at http://localhost:8095
# No build step required!

Production Binary

# Build single binary with embedded assets
cd dashboard
go build -o ../bin/prism-dashboard main.go

# Binary size: ~15MB (includes templates + static assets)
# Run anywhere (no dependencies)
./bin/prism-dashboard

Docker Compose (Optional)

# docker-compose.dashboard.yml

version: '3.8'

services:
dashboard:
build:
context: ./dashboard
dockerfile: Dockerfile
container_name: prism-dashboard
ports:
- "8095:8095"
environment:
- PROXY_METRICS_URL=http://prism-proxy:8980/metrics
- SIGNOZ_API_URL=http://signoz-query:8080
- ADMIN_API_URL=http://prism-admin:8090
networks:
- prism

networks:
prism:
external: true
# dashboard/Dockerfile

FROM golang:1.21-alpine AS builder
WORKDIR /build
COPY . .
RUN go build -o prism-dashboard main.go

FROM alpine:latest
COPY --from=builder /build/prism-dashboard /usr/local/bin/
EXPOSE 8095
CMD ["prism-dashboard"]

Makefile Targets

# Makefile

.PHONY: dashboard-run dashboard-build dashboard-dev dashboard-up dashboard-down

dashboard-run:
@echo "Starting Prism Dashboard (Go)..."
cd dashboard && go run main.go

dashboard-build:
@echo "Building Dashboard binary..."
cd dashboard && go build -o ../bin/prism-dashboard main.go
@echo "Binary: bin/prism-dashboard (size: $(shell du -h bin/prism-dashboard | cut -f1))"

dashboard-dev:
@echo "Starting Dashboard with live reload..."
cd dashboard && air
# Install air: go install github.com/cosmtrek/air@latest

dashboard-up:
@echo "Starting Prism Operations Dashboard (Docker)..."
docker-compose -f docker-compose.dashboard.yml up -d
@echo "Dashboard: http://localhost:8095"

dashboard-down:
docker-compose -f docker-compose.dashboard.yml down

dashboard-assets:
@echo "Downloading frontend assets..."
mkdir -p dashboard/static/js
curl -o dashboard/static/js/htmx.min.js https://unpkg.com/htmx.org@1.9.10/dist/htmx.min.js
curl -o dashboard/static/js/d3.v7.min.js https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js
curl -o dashboard/static/js/mermaid.min.js https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js

Alternatives Considered

Alternative 1: Grafana Dashboards

Pros:

  • Industry standard, mature
  • Excellent charting library
  • Integrates with Prometheus/SigNoz
  • Alerting built-in

Cons:

  • Generic (not tailored to Prism's specific patterns)
  • Requires learning Grafana query language
  • Less real-time feel (polling-based)
  • No custom interaction (e.g., click pattern to drill down)

Rejected because: Custom dashboard provides better UX for Prism-specific workflows

Alternative 2: SigNoz-Only Approach

Pros:

  • Already using SigNoz
  • No additional service to maintain
  • Built-in trace/metric visualization

Cons:

  • Not optimized for operational health monitoring
  • No pattern-specific views
  • Can't aggregate across multiple data sources (SigNoz + Admin API + pattern health checks)
  • No custom alerts/thresholds

Rejected because: SigNoz is for debugging/analysis, not operational monitoring

Alternative 3: Prometheus + AlertManager Only

Pros:

  • Simple, battle-tested
  • Low resource footprint
  • Alert-focused

Cons:

  • No visual dashboard (alerts via notifications only)
  • Requires configuring complex PromQL queries
  • No real-time drill-down
  • Limited context (just metrics, no traces)

Rejected because: Alerts are reactive; dashboard is proactive

Alternative 4: Framework-Based (FastAPI + React)

Pros:

  • Rich React component ecosystem
  • FastAPI modern Python framework
  • TypeScript type safety
  • Popular stack, easy to hire for

Cons:

  • Build complexity: npm install (5min), webpack build (10-60s)
  • Dependency hell: 500+ npm packages
  • Slow iteration: Change → rebuild → reload (10s+)
  • Large bundles: 2-5MB JavaScript
  • Framework churn: React/Vue/Svelte versions change
  • Language mismatch: Python + JavaScript = context switching
  • Debugging complexity: Source maps, transpilation issues

Rejected because: Framework-less approach (Go + HTMX + D3.js) provides:

  • Instant reload (no build step)
  • 300KB bundle vs 2-5MB
  • Single language (Go for backend + templates)
  • Simpler deployment (single binary vs Python + Node.js)
  • Aligned with project philosophy (ADR-061)

See ADR-061: Framework-Less Web UI for complete rationale.

Success Metrics

Dashboard Effectiveness

MetricTargetMeasurement
MTTR Reduction50% fasterTime to identify root cause
Issue Detection Before User Reports95%Issues caught by dashboard vs user tickets
Developer Adoption80% daily usageUnique dashboard users per day
Alert Noise Reduction30% fewer false alertsAlert count before/after dashboard
Dashboard Performance<2s end-to-end latencyTime from metric change to UI update

Technical Metrics

MetricTarget
WebSocket Uptime>99.9%
API Response TimeP99 <100ms
Frontend Load Time<2s initial load
Memory Footprint<256MB backend, <50MB frontend
Data Freshness<5s lag from source

Implementation Phases

Phase 1: MVP (Week 1-2)

  • View 1: System Health (status banner, pattern grid)
  • Backend: Go HTTP server with WebSocket
  • Frontend: Go templates + HTMX + D3.js
  • Data collectors: Prometheus scraper, gRPC health client
  • Deliverable: Working dashboard at localhost:8095, instant reload

Phase 2: Performance & Backend Views (Week 3)

  • View 2: Performance Monitoring (latency, throughput, SLO)
  • View 3: Backend Health (connection pools)
  • Integrate SigNoz API queries
  • D3.js charts for latency/throughput visualization

Phase 3: Messaging & Multi-Tenancy (Week 4)

  • View 4: Messaging Flow (PubSub metrics)
  • View 5: Multi-Tenancy (namespace breakdown)
  • HTMX partial updates for pattern grid
  • Mermaid.js diagrams for message flow

Phase 4: Security & Observability (Week 5)

  • View 6: Security Monitoring
  • View 7: Observability Health
  • View 8: System Resources
  • Advanced D3.js visualizations (heatmaps, pie charts)

Phase 5: Polish & Production (Week 6)

  • Alert history and notification integration
  • Dark mode support (CSS variables)
  • Mobile-responsive layout (Tailwind optional)
  • Single binary build with embedded assets
  • Documentation and Makefile targets

Open Questions

  1. Should alerts be embedded in dashboard or use separate AlertManager?

    • Proposal: Dashboard shows alerts but doesn't manage them (SigNoz AlertManager owns alert logic)
    • Reasoning: Separation of concerns; dashboard is read-only view
  2. How to handle multi-proxy deployments (future)?

    • Proposal: Add proxy selector dropdown in UI, query metrics per proxy
    • Reasoning: Defer until multi-proxy is required (post-POC)
  3. Should dashboard persist historical data or rely on SigNoz?

    • Proposal: No persistence; query SigNoz for historical views
    • Reasoning: Avoid data duplication, SigNoz is source of truth
  4. Access control for dashboard?

    • Proposal: Phase 1 = no auth (local dev), Phase 2 = integrate with Dex OIDC
    • Reasoning: Focus on functionality first, add auth for production
  5. Mobile app or web-only?

    • Proposal: Web-only, responsive design for mobile browsers
    • Reasoning: Native mobile app is significant scope expansion

References

Dashboard Design Patterns

Technologies

Revision History

  • 2025-11-07: Initial draft with 8 prioritized views and technical architecture