RFC-050: Operations Dashboard (HUD)
Status: Draft Author: Platform Team Created: 2025-11-07 Updated: 2025-11-07
Abstract
This RFC defines the Prism Operations Dashboard (internally called "HUD" - Heads-Up Display), a real-time web interface providing comprehensive visibility into Prism proxy health, pattern status, backend connectivity, and system performance.
Core Principle: Detect issues before users notice through proactive monitoring of high-probability failure indicators.
The dashboard provides:
- Real-time System Health: Overall status with drill-down to component details
- Performance Monitoring: Latency, throughput, and SLO compliance tracking
- Pattern & Backend Health: Lifecycle state and connection pool monitoring
- Messaging Visibility: PubSub message flow and delivery health
- Multi-Tenancy Insights: Per-namespace resource usage and error rates
- Security Monitoring: Authentication success rates and authorization metrics
- Observability Health: OpenTelemetry pipeline status
- Resource Utilization: System-level CPU, memory, and capacity tracking
Technology Stack: Go HTTP server (backend), HTMX + D3.js (frontend), WebSocket for real-time updates, Prometheus metrics + SigNoz traces. Framework-less approach with no build step required (see ADR-061).
Motivation
Problem Statement
Current Situation (Post-POC 1-3):
- ✅ Proxy spawns and manages patterns successfully
- ✅ Patterns communicate with backends (MemStore, Redis, NATS)
- ✅ OpenTelemetry traces sent to SigNoz
- ❌ No unified operational view: Developers must check:
- SigNoz UI for traces
- Docker logs for pattern health
curlrequests for metrics- Manual gRPC health checks
- Backend-specific tools (redis-cli, nats-cli)
Pain Points:
- Reactive Debugging: Issues discovered after failures occur
- Scattered Information: No single pane of glass for system health
- Slow MTTR: Mean Time To Resolution high due to information gathering overhead
- No SLO Visibility: Can't track if meeting latency/reliability targets
- Pattern Health Blind Spots: Process crashes/restarts go unnoticed
- Backend Issues Surface Late: Connection pool exhaustion detected only after timeouts
Goals
- Proactive Issue Detection: Catch problems before user impact (target: 95% issues detected within 30s)
- Single Pane of Glass: All critical system health in one view
- Fast MTTR: Reduce time to identify root cause from minutes to seconds
- SLO Tracking: Visualize compliance with latency/reliability targets
- Developer Productivity: Eliminate manual metric gathering during debugging
- Production Readiness: Dashboard suitable for both local dev and production
Non-Goals
- Application Metrics: Dashboard focuses on Prism infrastructure, not application-level business metrics
- Log Aggregation: Logs remain in SigNoz; dashboard shows summary insights only
- Alerting Engine: Dashboard surfaces metrics but doesn't replace alert manager (use SigNoz alerts)
- Historical Analysis: Focus on real-time (last 24h), not long-term trends (use SigNoz for that)
- Multi-Cluster Management: Single cluster/proxy instance view (multi-cluster is future work)
Architecture Overview
System Context
Data Flow
- Proxy & Patterns: Emit metrics (Prometheus format) + traces (OTLP)
- Dashboard Backend: Aggregates data from multiple sources every 5 seconds
- WebSocket Push: Real-time updates to browser clients
- Frontend Rendering: HTMX + D3.js display live data with <2s latency
Technology Stack
See ADR-061: Framework-Less Web UI for complete rationale.
Backend (dashboard/):
- Language: Go
- HTTP Server:
net/httpstandard library +gorilla/mux - Templates: Go
html/template(server-side rendering) - Metrics Scraping: Native Prometheus text format parser
- gRPC Client: Native Go gRPC client (call pattern health checks)
- SigNoz Integration:
net/httpclient (query SigNoz API) - WebSocket:
gorilla/websocket(push updates to frontend) - Caching: Optional in-memory cache (reduce SigNoz query load)
Frontend (embedded in dashboard/static/ and dashboard/templates/):
- HTML: Go templates (server-rendered, type-safe)
- Interactivity: HTMX 1.9+ (14KB, replaces React)
- Visualization: D3.js v7 (~70KB, best-in-class charting)
- Diagrams: Mermaid.js (~200KB, text-to-diagram)
- Styling: Plain CSS or Tailwind (optional)
- WebSocket Client: Native browser WebSocket API
- Build Step: NONE - no npm, no webpack, instant reload
Deployment:
- Binary: Single Go executable (
go build) - Assets: Embedded with
//go:embeddirective (templates + static files) - Development:
make dashboard-runstarts server (instant reload withair) - Production: Single binary or Docker container (no Node.js runtime needed)
Dashboard Views
View Hierarchy
The dashboard uses a hub-and-spoke model with a primary System Health view and drill-down panels:
┌─────────────────────────────────────────────────────────┐
│ System Health (Hub) │ ← Always visible
│ Overall status, critical metrics, pattern grid │
├─────────────────────────────────────────────────────────┤
│ Drill-Down Panels (Spoke) │ ← Accessed via tabs/clicks
│ ┌──────────────┬──────────────┬──────────────┐ │
│ │ Performance │ Backend │ Messaging │ │
│ │ Monitoring │ Health │ Flow │ │
│ ├──────────────┼──────────────┼──────────────┤ │
│ │ Multi- │ Security │ Observability│ │
│ │ Tenancy │ Monitoring │ Health │ │
│ ├──────────────┼──────────────┼──────────────┤ │
│ │ System │ Alerts │ Settings │ │
│ │ Resources │ History │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
└─────────────────────────────────────────────────────────┘
View 1: System Health (Primary View)
Priority: CRITICAL (Always visible) Detection Probability: 99% Update Frequency: 2 seconds
Purpose: Single-glance answer to "Is Prism working?"
Layout
┌────────────────────────────────────────────────────────────────┐
│ Prism Operations Dashboard [Auto-refresh: 2s]│
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ 🟢 SYSTEM HEALTHY ││
│ │ ││
│ │ ✅ 99.97% Success Rate 📊 8,432 RPS ⚡ 0.8ms P99 ││
│ │ 🔒 99.8% Auth Success 🔗 3 Patterns 📦 15 Namespaces ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Pattern Health Grid │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Pattern Status Phase Uptime Restarts │ │
│ ├──────────────────────────────────────────────────────────┤ │
│ │ MemStore 🟢 HEALTHY Running 4d 3h 0 │ │
│ │ Redis 🟢 HEALTHY Running 4d 3h 0 │ │
│ │ NATS 🟡 DEGRADED Running 2h 15m 2 │◄─┐
│ │ PostgreSQL 🟢 HEALTHY Running 4d 3h 0 │ │
│ └──────────────────────────────────────────────────────────┘ │
│ [Click row for details] ──┘
│ │
├────────────────────────────────────────────────────────────────┤
│ Critical Metrics (Last 5 minutes) │
│ ┌────────────┬────────────┬────────────┬────────────────────┐ │
│ │ Latency │ Throughput │ Error Rate │ Backend Pools │ │
│ ├────────────┼─ ───────────┼────────────┼────────────────────┤ │
│ │ P50: 0.4ms │ Read: 5.2k│ 0.03% │ Redis: 7/10 🟢 │ │
│ │ P99: 0.8ms │ Write: 3.2k│ 25 errors │ NATS: Connected 🟡 │ │
│ │ P999: 2.1ms│ Total: 8.4k│ /min │ PG: 3/20 🟢 │ │
│ │ │ │ │ │ │
│ │ [Chart] │ [Chart] │ [Chart] │ [Status Grid] │ │
│ └────────────┴────────────┴────────────┴────────────────────┘ │
│ │
├────────────────────────────────────────────────────────────────┤
│ Recent Alerts │
│ 🟡 2m ago: NATS pattern restarted (restart loop detected) │
│ 🟢 15m ago: Redis pool capacity >90% (now resolved) │
│ [View All Alerts →] │
└────────────────────────────────────────────────────────────────┘
Components
1. System Status Banner
- Overall Health: Computed from all patterns (GREEN if all healthy, YELLOW if any degraded, RED if any unhealthy)
- Success Rate:
(successful_requests / total_requests) * 100(last 5 min) - RPS: Current requests per second (read + write)
- P99 Latency: 99th percentile latency (last 5 min)
- Auth Success: Authentication success rate
- Active Patterns: Count of patterns in HEALTHY or DEGRADED state
- Namespaces: Total active namespaces
2. Pattern Health Grid
- Status: 🟢 HEALTHY / 🟡 DEGRADED / 🔴 UNHEALTHY (from gRPC HealthCheck)
- Phase: spawn → connect → initialize → start → running
- Uptime: Time since last start
- Restarts: Restart count in last 24h
- Click Action: Drill down to pattern detail view
3. Critical Metrics Charts
- Latency Chart: Line chart showing P50/P99/P999 over last 5 minutes
- Throughput Chart: Stacked area chart (read vs write RPS)
- Error Rate Chart: Bar chart of errors per minute
- Backend Pools: Connection pool status for each backend
4. Recent Alerts
- Last 5 alerts/warnings with timestamp and auto-resolution status
- Color-coded by severity: 🔴 Critical / 🟡 Warning / 🟢 Resolved
Data Sources
// dashboard/handlers/system_health.go
func SystemHealthHandler(w http.ResponseWriter, r *http.Request) {
// Aggregate system health from all data sources
// 1. Scrape proxy metrics (Prometheus)
proxyMetrics, err := scrapePrometheus("http://localhost:8980/metrics")
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
// 2. Query pattern health checks (gRPC)
patternHealth, err := queryAllPatternsHealth(context.Background())
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
// 3. Query SigNoz for latency percentiles
latency, err := querySigNozLatency("5m")
if err != nil {
log.Printf("SigNoz query error: %v", err)
// Continue with empty latency data
}
// 4. Query admin API for namespace count
namespaces, err := adminClient.ListNamespaces(context.Background())
if err != nil {
log.Printf("Admin API error: %v", err)
}
data := SystemHealthData{
OverallStatus: computeOverallHealth(patternHealth),
SuccessRate: proxyMetrics.SuccessRate,
RPS: proxyMetrics.RequestsPerSecond,
LatencyP99: latency.P99,
AuthSuccessRate: proxyMetrics.AuthSuccessRate,
Patterns: patternHealth,
NamespaceCount: len(namespaces),
Timestamp: time.Now(),
}
// Render Go template
templates.ExecuteTemplate(w, "dashboard.html", data)
}
WebSocket Push
Client-side (in Go template):
<!-- dashboard/templates/dashboard.html -->
<script>
// Native WebSocket API (no React hooks needed)
const ws = new WebSocket('ws://' + location.host + '/ws/system-health');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
// Update status banner
document.getElementById('success-rate').textContent = data.success_rate.toFixed(2) + '%';
document.getElementById('rps').textContent = data.rps;
document.getElementById('latency-p99').textContent = data.latency_p99.toFixed(1) + 'ms';
// Update charts with D3.js
updateLatencyChart(data.latency_history);
updateThroughputChart(data.throughput_history);
// Update last refresh time
document.getElementById('last-update').textContent = new Date().toLocaleTimeString();
};
ws.onerror = (error) => {
console.error('WebSocket error:', error);
document.getElementById('status-indicator').className = 'status-error';
};
</script>
View 2: Performance Monitoring
Priority: HIGH Detection Probability: 95% Update Frequency: 5 seconds
Purpose: Detailed latency, throughput, and SLO compliance tracking
Layout
┌────────────────────────────────────────────────────────────────┐
│ Performance Monitoring │
├────────────────────────────────────────────────────────────────┤
│ │
│ SLO Compliance │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Target: 99.9% requests < 10ms (P99) ││
│ │ Current: 99.7% ✅ ││
│ │ [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 99.7% ││
│ │ ││
│ │ Last 24h: 99.8% ✅ | Last 7d: 99.9% ✅ ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Latency Breakdown (Last 1 hour) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Operation P50 P99 P999 Max Count ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ KeyValue.Set 0.3ms 0.7ms 1.2ms 4.5ms 125k ││
│ │ KeyValue.Get 0.2ms 0.5ms 0.9ms 3.1ms 287k ││
│ │ PubSub.Pub 0.4ms 0.9ms 2.1ms 8.7ms 42k ││
│ │ PubSub.Sub 0.8ms 2.3ms 5.4ms 15.2ms 18k ││
│ └────────────────────────────────────────────────────────────┘│
│ │
│ Latency Heatmap (Time vs Percentile) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ 10ms ┤ ▂▃ ││
│ │ 5ms ┤ ▁▂▃▄▅▆▇███ ││
│ │ 1ms ┤ ▁▂▃▄▅▆▇███ ││
│ │ 0.5ms┤ ▁▂▃▄▅▆▇████████ ││
│ │ 0.1ms┤▁▂▃▄▅▆▇█████ ││
│ │ └────────────────────────────────────────────────────┘││
│ │ 12:00 12:15 12:30 12:45 13:00 13:15 13:30 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Throughput by Pattern │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ [Stacked Area Chart] ││
│ │ 10k ┤ ███ ││
│ │ 8k ┤ ▄▄▄▄▄███ ││
│ │ 6k ┤ ▃▃▃▃▃█████████ ││
│ │ 4k ┤ ▂▂▂▂▂▂██████████████ ││
│ │ 2k ┤ ▁▁▁▁▁▁▁▁▁▁████████████████████ ││
│ │ └────────────────────────────────────────────────────┘││
│ │ MemStore ▀▀▀ Redis ▀▀▀ NATS ▀▀▀ PostgreSQL ▀▀▀ ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
Key Features
- SLO Compliance Tracker: Visual progress bar showing % of requests meeting latency target
- Operation-Level Latency: Breakdown by operation type (Set, Get, Publish, Subscribe)
- Latency Heatmap: Visualize latency distribution over time (identify spikes)
- Throughput by Pattern: See which patterns are handling most traffic
Data Sources
- Latency Percentiles: Query SigNoz traces aggregated by operation
- SLO Compliance: Count requests with latency < 10ms / total requests
- Throughput: Scrape Prometheus metrics from proxy (
prism_requests_totalcounter)
View 3: Backend Health
Priority: HIGH Detection Probability: 92% Update Frequency: 5 seconds
Purpose: Connection pool health and backend connectivity status
Layout
┌────────────────────────────────────────────────────────────────┐
│ Backend Health │
├────────────────────────────────────────────────────────────────┤
│ │
│ Connection Pools │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Backend Type Active Idle Max Util Status ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ Redis KeyValue 7 3 10 70% 🟢 HEALTHY ││
│ │ PubSub 2 8 10 20% 🟢 HEALTHY ││
│ │ NATS PubSub 1 0 1 100% 🟡 DEGRADED ││
│ │ PostgreSQL KeyValue 3 17 20 15% 🟢 HEALTHY ││
│ │ Queue 5 15 20 25% 🟢 HEALTHY ││
│ │ MemStore KeyValue N/A N/A N/A N/A 🟢 HEALTHY ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Connection Metrics (Last 5 minutes) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Redis ││
│ │ ├─ Connections: [Time series chart] ││
│ │ ├─ Acquisition Time: Avg 2.1ms P99 8.3ms ││
│ │ ├─ Errors: 0 refused, 0 timeout, 0 reset ││
│ │ └─ Pool Capacity: [Progress bar] 70% ││
│ │ ││
│ │ NATS ││
│ │ ├─ Connection State: CONNECTED (reconnects: 2) ││
│ │ ├─ Subscriptions: 127 active ││
│ │ ├─ Stats: In: 42k msgs (8.4 MB) Out: 18k msgs (1.2 MB) ││
│ │ └─ Pending: 0 messages ││
│ │ ││
│ │ PostgreSQL ││
│ │ ├─ Connections: 8/20 active ││
│ │ ├─ Active Queries: 3 ││
│ │ ├─ Query Duration: Avg 12ms P99 45ms ││
│ │ └─ Pool Wait Time: Avg 0.3ms ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
Key Features
- Connection Pool Table: Shows active/idle/max connections per backend
- Utilization Tracking: Visual indicator when approaching capacity (>90% = yellow)
- Connection Acquisition Time: How long to get a connection from pool
- Error Tracking: Connection refused, timeout, reset counts
- Backend-Specific Metrics:
- Redis: Pool stats from
PoolStats() - NATS: Connection state, subscription count, message stats
- PostgreSQL: Active queries, query duration
- Redis: Pool stats from
Data Sources
// patterns/redis/plugin.go
func (r *RedisPlugin) HealthCheck(ctx context.Context) *HealthCheckResponse {
stats := r.client.PoolStats()
return &HealthCheckResponse{
Status: computeStatus(stats),
Metadata: map[string]string{
"connections_active": fmt.Sprintf("%d", stats.TotalConns - stats.IdleConns),
"connections_idle": fmt.Sprintf("%d", stats.IdleConns),
"pool_size": fmt.Sprintf("%d", r.config.PoolSize),
"utilization_pct": fmt.Sprintf("%.1f", float64(stats.TotalConns - stats.IdleConns) / float64(r.config.PoolSize) * 100),
},
}
}
View 4: Messaging Flow (PubSub)
Priority: MEDIUM-HIGH Detection Probability: 88% Update Frequency: 5 seconds
Purpose: PubSub message delivery health and subscriber tracking
Layout
┌────────────────────────────────────────────────────────────────┐
│ Messaging Flow (PubSub) │
├ ────────────────────────────────────────────────────────────────┤
│ │
│ Message Flow Health │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Published: 42,187 msgs | Delivered: 126,561 msgs ✅ ││
│ │ Dropped: 12 msgs (0.03%) | Pending: 0 msgs ││
│ │ ││
│ │ Delivery Ratio: 3.0x (fanout working correctly) ││
│ │ Delivery Latency: P99 2.3ms ✅ ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Active Topics & Subscribers │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Topic Subs Pub/sec Del/sec Latency Status ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ events.user 3 142 426 1.2ms 🟢 ││
│ │ events.system 5 87 435 0.9ms 🟢 ││
│ │ logs.application 1 523 523 0.4ms 🟢 ││
│ │ alerts.critical 12 2 24 15.2ms 🟡 ││
│ └────────── ──────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Message Timeline (Last 5 minutes) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Published ▀▀▀ ││
│ │ Delivered ▀▀▀ ││
│ │ Dropped ▀▀▀ ││
│ │ ││
│ │ [Multi-line chart showing published vs delivered vs dropped]││
│ └────────── ──────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Subscriber Details │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Subscriber ID Topic Msgs Recv Lag ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ sub-worker-1 events.user 14,235 0ms ││
│ │ sub-worker-2 events.user 14,190 0ms ││
│ │ sub-worker-3 events.user 14,201 0ms ││
│ │ sub-analytics events.system 8,745 0ms ││
│ │ sub-logger logs.application 52,301 125ms 🟡 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
Key Features
- Message Flow Summary: Published vs Delivered (should be N:1 for fanout)
- Dropped Messages: Count and percentage (should be near zero)
- Delivery Latency: Time from publish to subscriber receive
- Topic Breakdown: Per-topic subscriber count and throughput
- Subscriber Lag: Identify slow subscribers (lag >100ms = warning)
Data Sources
// patterns/nats/plugin.go
func (n *NATSPlugin) HealthCheck(ctx context.Context) *HealthCheckResponse {
stats := n.conn.Stats()
return &HealthCheckResponse{
Status: HEALTHY,
Metadata: map[string]string{
"subscription_count": fmt.Sprintf("%d", len(n.subscriptions)),
"in_msgs": fmt.Sprintf("%d", stats.InMsgs),
"out_msgs": fmt.Sprintf("%d", stats.OutMsgs),
"in_bytes": fmt.Sprintf("%d", stats.InBytes),
"out_bytes": fmt.Sprintf("%d", stats.OutBytes),
"dropped_msgs": fmt.Sprintf("%d", n.droppedMessageCount),
},
}
}
View 5: Multi-Tenancy (Namespaces)
Priority: MEDIUM Detection Probability: 85% Update Frequency: 10 seconds
Purpose: Per-namespace resource usage and error tracking
Layout
┌────────────────────────────────────────────────────────────────┐
│ Multi-Tenancy (Namespaces) │
├────────────────────────────────────────────────────────────────┤
│ │
│ Active Namespaces (15 total) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Namespace RPS Latency Errors Patterns Status ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ user-platform 4,231 0.8ms 0.02% KV, PS 🟢 ││
│ │ payments 1,847 1.2ms 0.01% KV, Q 🟢 ││
│ │ analytics 892 2.3ms 0.05% PS, TS 🟢 ││
│ │ notifications 645 0.9ms 1.2% PS 🟡 ││
│ │ search-index 387 5.4ms 0.03% KV, G 🟢 ││
│ │ ... (10 more) ... ... ... ... ... ││
│ └────────────────────────────────────────────────────────────┘│
│ [Show All →] │
│ │
├────────────────────────────────────────────────────────────────┤
│ Traffic Distribution (Top 10 by RPS) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ [Pie Chart] ││
│ │ ││
│ │ user-platform: 50.2% ││
│ │ payments: 21.9% ││
│ │ analytics: 10.6% ││
│ │ notifications: 7.6% ││
│ │ others: 9.7% ││
│ └─── ─────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Namespace Details: user-platform │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Configuration ││
│ │ ├─ Patterns: KeyValue (Redis), PubSub (NATS) ││
│ │ ├─ Created: 2025-10-15 ││
│ │ ├─ Owner: team-user-platform@example.com ││
│ │ └─ Max RPS: 10,000 (current: 42%) ││
│ │ ││
│ │ Performance (Last 1 hour) ││
│ │ ├─ P99 Latency: [Chart] ││
│ │ ├─ Throughput: [Chart] ││
│ │ └─ Error Rate: [Chart] ││
│ │ ││
│ │ Top Operations ││
│ │ ├─ KeyValue.Get: 3,201 RPS ││
│ │ ├─ KeyValue.Set: 987 RPS ││
│ │ └─ PubSub.Publish: 43 RPS ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
Key Features
- Namespace Table: RPS, latency, error rate per namespace
- Traffic Distribution: Identify noisy neighbors (>80% of traffic)
- Namespace Drill-Down: Detailed metrics for selected namespace
- Capacity Tracking: Show RPS vs max configured capacity
- Pattern Usage: Which patterns each namespace uses
Data Sources
- Admin API: Query namespace configurations
- SigNoz Traces: Filter by namespace tag for per-namespace metrics
- Proxy Metrics: Scrape
prism_requests_total{namespace="..."}label
View 6: Security Monitoring
Priority: MEDIUM Detection Probability: 80% Update Frequency: 10 seconds
Purpose: Authentication and authorization tracking
Layout
┌────────────────────────────────────────────────────────────────┐
│ Security Monitoring │
├────────────────────────────────────────────────────────────────┤
│ │
│ Authentication Health │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ JWT Validation Success: 99.8% ✅ ││
│ │ Token Refreshes: 127 (last 1h) ││
│ │ Failed Attempts: 8 (last 1h) ││
│ │ Dex Connectivity: 🟢 CONNECTED ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Failed Authentication Attempts (Last 24 hours) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Time Reason Source IP User ││
│ ├───────────────────────────────────────────────── ───────────┤│
│ │ 13:24:15 Token expired 10.0.1.45 alice@ ││
│ │ 13:18:42 Invalid signature 10.0.2.12 unknown ││
│ │ 12:45:33 Token expired 10.0.1.45 alice@ ││
│ │ 11:32:18 Missing token 10.0.3.88 unknown ││
│ │ 11:15:07 Invalid issuer 10.0.2.99 unknown ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Authorization (Last 1 hour) │
│ ┌──────────────────────────────────────────────────────── ────┐│
│ │ Total Requests: 508,432 ││
│ │ Authorized: 508,401 (99.99%) ││
│ │ Denied: 31 (0.01%) ││
│ │ ││
│ │ Denial Reasons: ││
│ │ ├─ Namespace access denied: 18 ││
│ │ ├─ Pattern not allowed: 8 ││
│ │ ├─ Rate limit exceeded: 5 ││
│ │ └─ Invalid operation: 0 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ JWKS Cache │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Cache Hits: 508,401 (99.99%) ││
│ │ Cache Misses: 31 (0.01%) ││
│ │ Last Refresh: 13:15:42 (15 minutes ago) ││
│ │ Next Refresh: 13:45:42 (in 15 minutes) ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────── ─────────────┘
Key Features
- Auth Success Rate: Should be >99% (excludes expected token expirations)
- Failed Attempts Log: Investigate suspicious patterns (same IP, repeated failures)
- Authorization Denials: Track why requests are denied
- JWKS Cache Health: Ensure public key cache is working
View 7: Observability Health
Priority: MEDIUM-LOW Detection Probability: 75% Update Frequency: 30 seconds
Purpose: Verify OpenTelemetry pipeline is working
Layout
┌────────────────────────────────────────────────────────────────┐
│ Observability Health │
├────────────────────────────────────────────────────────────────┤
│ │
│ OpenTelemetry Pipeline Status │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Component Status Last Exported ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ Prism Proxy → SigNoz 🟢 HEALTHY 2 seconds ago ││
│ │ Patterns → SigNoz 🟢 HEALTHY 3 seconds ago ││
│ │ OTLP Collector 🟢 HEALTHY 1 second ago ││
│ │ SigNoz (Query Service) 🟢 HEALTHY 5 seconds ago ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Export Metrics (Last 5 minutes) │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Traces Exported: 12,487 ││
│ │ Metrics Exported: 52,301 ││
│ │ Logs Exported: 8,945 ││
│ │ ││
│ │ Export Errors: 3 (0.02%) ││
│ │ Export Latency: P99 12ms ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Trace Coverage │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Traces with All Spans: 12,401 (99.3%) ││
│ │ Traces Missing Spans: 86 (0.7%) ││
│ │ ││
│ │ Missing Spans Breakdown: ││
│ │ ├─ Backend span missing: 45 ││
│ │ ├─ Pattern span missing: 31 ││
│ │ └─ Proxy span missing: 10 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
Key Features
- Pipeline Status: Verify all components exporting telemetry
- Export Metrics: Count of traces/metrics/logs exported
- Trace Coverage: Identify missing spans (should have proxy + pattern + backend)
View 8: System Resources
Priority: LOW Detection Probability: 70% Update Frequency: 10 seconds
Purpose: CPU, memory, and capacity tracking
Layout
┌────────────────────────────────────────────────────────────────┐
│ System Resources │
├────────────────────────────────────────────────────────────────┤
│ │
│ Prism Proxy │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Memory: 287 MB / 500 MB (57%) [████████▌ ] ││
│ │ CPU: 12% (0.48 cores) ││
│ │ Threads: 24 ││
│ │ File Descriptors: 156 / 1024 (15%) ││
│ │ Uptime: 4d 3h 24m ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ Pattern Processes │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Pattern Memory CPU Goroutines Uptime ││
│ ├────────────────────────────────────────────────────────────┤│
│ │ MemStore 12 MB 2% 8 4d 3h ││
│ │ Redis 45 MB 5% 12 4d 3h ││
│ │ NATS 38 MB 8% 15 2h 15m ││
│ │ PostgreSQL 67 MB 3% 20 4d 3h ││
│ └────────────────────────────────────────────────────────────┘│
│ │
├────────────────────────────────────────────────────────────────┤
│ System Totals │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Total Memory: 449 MB ││
│ │ Total CPU: 30% (1.2 cores) ││
│ │ Total Processes: 5 ││
│ └────────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
Technical Implementation
See ADR-061: Framework-Less Web UI for complete implementation details and code examples.
This section provides a high-level overview. ADR-061 contains comprehensive Go code examples, HTMX patterns, and D3.js visualization code.
Backend Architecture
dashboard/
├── main.go # Go HTTP server entry point
├── config.go # Configuration (env vars)
├── handlers/
│ ├── dashboard.go # View 1 (System Health)
│ ├── performance.go # View 2
│ ├── backends.go # View 3
│ ├── messaging.go # View 4
│ ├── namespaces.go # View 5
│ ├── security.go # View 6
│ ├── observability.go # View 7
│ ├── resources.go # View 8
│ ├── api.go # JSON API for HTMX
│ └── websocket.go # WebSocket hub
├── collectors/
│ ├── prometheus.go # Scrape proxy metrics
│ ├── grpc_health.go # Query pattern health checks
│ ├── signoz.go # Query SigNoz API
│ └── admin.go # Query Admin API
├── aggregators/
│ ├── system_health.go # Aggregate system health
│ ├── performance.go # Compute percentiles, SLO
│ └── messaging.go # Compute message flow metrics
├── templates/
│ ├── dashboard.html # Main dashboard template
│ ├── performance.html # Performance view template
│ ├── backends.html # Backend health template
│ └── partials/
│ ├── pattern_grid.html # Reusable pattern grid
│ └── metrics_chart.html # Reusable chart component
├── static/
│ ├── css/
│ │ └── dashboard.css # Custom styles
│ ├── js/
│ │ ├── htmx.min.js # HTMX (14KB)
│ │ ├── d3.v7.min.js # D3.js (70KB)
│ │ ├── mermaid.min.js # Mermaid.js (200KB)
│ │ └── dashboard.js # Custom WebSocket + D3 logic
│ └── assets/
│ └── prism-logo.svg # Static assets
└── embed.go # Embed templates/static with go:embed
Go Backend Example
See ADR-061 for complete implementation with all collectors and aggregators.
// dashboard/main.go
package main
import (
"embed"
"html/template"
"log"
"net/http"
"time"
"github.com/gorilla/mux"
"github.com/gorilla/websocket"
)
//go:embed templates/* static/*
var content embed.FS
var (
templates *template.Template
upgrader = websocket.Upgrader{
ReadBufferSize: 1024,
WriteBufferSize: 1024,
}
)
func init() {
templates = template.Must(template.ParseFS(content, "templates/*.html", "templates/partials/*.html"))
}
func main() {
r := mux.NewRouter()
// Serve static files (embedded)
r.PathPrefix("/static/").Handler(http.FileServer(http.FS(content)))
// Page routes (render Go templates)
r.HandleFunc("/", dashboardHandler).Methods("GET")
r.HandleFunc("/performance", performanceHandler).Methods("GET")
r.HandleFunc("/backends", backendsHandler).Methods("GET")
// API routes (JSON/HTML fragments for HTMX)
r.HandleFunc("/api/health", apiHealthHandler).Methods("GET")
r.HandleFunc("/api/patterns", apiPatternsHandler).Methods("GET")
r.HandleFunc("/api/performance", apiPerformanceHandler).Methods("GET")
// WebSocket for real-time updates
r.HandleFunc("/ws", wsHandler)
// Start background data collector
go startCollector()
log.Println("Dashboard starting on :8095")
log.Fatal(http.ListenAndServe(":8095", r))
}
func dashboardHandler(w http.ResponseWriter, r *http.Request) {
data := collectSystemHealth()
templates.ExecuteTemplate(w, "dashboard.html", data)
}
func apiHealthHandler(w http.ResponseWriter, r *http.Request) {
// Return HTML fragment for HTMX partial replacement
data := collectSystemHealth()
templates.ExecuteTemplate(w, "pattern_grid.html", data)
}
func wsHandler(w http.ResponseWriter, r *http.Request) {
conn, err := upgrader.Upgrade(w, r, nil)
if err != nil {
log.Println("WebSocket upgrade error:", err)
return
}
defer conn.Close()
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for range ticker.C {
data := collectSystemHealth()
if err := conn.WriteJSON(data); err != nil {
return
}
}
}
Go Template Example (Server-Rendered HTML)
<!-- dashboard/templates/dashboard.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Prism Operations Dashboard</title>
<link rel="stylesheet" href="/static/css/dashboard.css">
<script src="/static/js/htmx.min.js"></script>
<script src="/static/js/d3.v7.min.js"></script>
</head>
<body>
<header>
<h1>Prism Operations Dashboard</h1>
<div class="refresh-indicator" id="last-update">Last updated: {{.Timestamp.Format "15:04:05"}}</div>
</header>
<main>
<!-- System Status Banner -->
<section class="status-banner status-{{.OverallStatus}}">
<div class="status-icon">{{if eq .OverallStatus "HEALTHY"}}🟢{{else if eq .OverallStatus "DEGRADED"}}🟡{{else}}🔴{{end}}</div>
<div class="status-text">SYSTEM {{.OverallStatus}}</div>
<div class="metrics-summary">
<span class="metric">✅ {{printf "%.2f" .SuccessRate}}% Success Rate</span>
<span class="metric">📊 {{.RPS}} RPS</span>
<span class="metric">⚡ {{printf "%.1f" .LatencyP99}}ms P99</span>
</div>
</section>
<!-- Pattern Health Grid (Auto-refresh with HTMX) -->
<section class="pattern-health">
<h2>Pattern Health</h2>
<div hx-get="/api/patterns" hx-trigger="load, every 5s" hx-swap="innerHTML">
{{template "pattern_grid.html" .}}
</div>
</section>
<!-- Critical Metrics Charts (D3.js) -->
<section class="metrics-charts">
<h2>Critical Metrics (Last 5 minutes)</h2>
<div class="chart-grid">
<div id="latency-chart" class="chart"></div>
<div id="throughput-chart" class="chart"></div>
<div id="error-chart" class="chart"></div>
</div>
</section>
<!-- Recent Alerts -->
<section class="alerts">
<h2>Recent Alerts</h2>
<ul class="alert-list" hx-get="/api/alerts" hx-trigger="load, every 10s" hx-swap="innerHTML">
{{range .RecentAlerts}}
<li class="alert-{{.Severity}}">
<span class="alert-time">{{.Timestamp.Format "15:04"}}</span>
{{.Message}}
</li>
{{end}}
</ul>
</section>
</main>
<script src="/static/js/dashboard.js"></script>
<script>
// Render D3.js charts on load
renderLatencyChart({{.LatencyHistory}});
renderThroughputChart({{.ThroughputHistory}});
renderErrorChart({{.ErrorHistory}});
// WebSocket for real-time updates
const ws = new WebSocket('ws://' + location.host + '/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
updateCharts(data);
document.getElementById('last-update').textContent = 'Last updated: ' + new Date().toLocaleTimeString();
};
</script>
</body>
</html>
Data Flow Sequence
Deployment
Local Development
# Start dashboard (single command, instant reload with air)
cd dashboard
go run main.go
# OR with live reload
air
# Dashboard available at http://localhost:8095
# No build step required!
Production Binary
# Build single binary with embedded assets
cd dashboard
go build -o ../bin/prism-dashboard main.go
# Binary size: ~15MB (includes templates + static assets)
# Run anywhere (no dependencies)
./bin/prism-dashboard
Docker Compose (Optional)
# docker-compose.dashboard.yml
version: '3.8'
services:
dashboard:
build:
context: ./dashboard
dockerfile: Dockerfile
container_name: prism-dashboard
ports:
- "8095:8095"
environment:
- PROXY_METRICS_URL=http://prism-proxy:8980/metrics
- SIGNOZ_API_URL=http://signoz-query:8080
- ADMIN_API_URL=http://prism-admin:8090
networks:
- prism
networks:
prism:
external: true
# dashboard/Dockerfile
FROM golang:1.21-alpine AS builder
WORKDIR /build
COPY . .
RUN go build -o prism-dashboard main.go
FROM alpine:latest
COPY --from=builder /build/prism-dashboard /usr/local/bin/
EXPOSE 8095
CMD ["prism-dashboard"]
Makefile Targets
# Makefile
.PHONY: dashboard-run dashboard-build dashboard-dev dashboard-up dashboard-down
dashboard-run:
@echo "Starting Prism Dashboard (Go)..."
cd dashboard && go run main.go
dashboard-build:
@echo "Building Dashboard binary..."
cd dashboard && go build -o ../bin/prism-dashboard main.go
@echo "Binary: bin/prism-dashboard (size: $(shell du -h bin/prism-dashboard | cut -f1))"
dashboard-dev:
@echo "Starting Dashboard with live reload..."
cd dashboard && air
# Install air: go install github.com/cosmtrek/air@latest
dashboard-up:
@echo "Starting Prism Operations Dashboard (Docker)..."
docker-compose -f docker-compose.dashboard.yml up -d
@echo "Dashboard: http://localhost:8095"
dashboard-down:
docker-compose -f docker-compose.dashboard.yml down
dashboard-assets:
@echo "Downloading frontend assets..."
mkdir -p dashboard/static/js
curl -o dashboard/static/js/htmx.min.js https://unpkg.com/htmx.org@1.9.10/dist/htmx.min.js
curl -o dashboard/static/js/d3.v7.min.js https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js
curl -o dashboard/static/js/mermaid.min.js https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js
Alternatives Considered
Alternative 1: Grafana Dashboards
Pros:
- Industry standard, mature
- Excellent charting library
- Integrates with Prometheus/SigNoz
- Alerting built-in
Cons:
- Generic (not tailored to Prism's specific patterns)
- Requires learning Grafana query language
- Less real-time feel (polling-based)
- No custom interaction (e.g., click pattern to drill down)
Rejected because: Custom dashboard provides better UX for Prism-specific workflows
Alternative 2: SigNoz-Only Approach
Pros:
- Already using SigNoz
- No additional service to maintain
- Built-in trace/metric visualization
Cons:
- Not optimized for operational health monitoring
- No pattern-specific views
- Can't aggregate across multiple data sources (SigNoz + Admin API + pattern health checks)
- No custom alerts/thresholds
Rejected because: SigNoz is for debugging/analysis, not operational monitoring
Alternative 3: Prometheus + AlertManager Only
Pros:
- Simple, battle-tested
- Low resource footprint
- Alert-focused
Cons:
- No visual dashboard (alerts via notifications only)
- Requires configuring complex PromQL queries
- No real-time drill-down
- Limited context (just metrics, no traces)
Rejected because: Alerts are reactive; dashboard is proactive
Alternative 4: Framework-Based (FastAPI + React)
Pros:
- Rich React component ecosystem
- FastAPI modern Python framework
- TypeScript type safety
- Popular stack, easy to hire for
Cons:
- ❌ Build complexity: npm install (5min), webpack build (10-60s)
- ❌ Dependency hell: 500+ npm packages
- ❌ Slow iteration: Change → rebuild → reload (10s+)
- ❌ Large bundles: 2-5MB JavaScript
- ❌ Framework churn: React/Vue/Svelte versions change
- ❌ Language mismatch: Python + JavaScript = context switching
- ❌ Debugging complexity: Source maps, transpilation issues
Rejected because: Framework-less approach (Go + HTMX + D3.js) provides:
- Instant reload (no build step)
- 300KB bundle vs 2-5MB
- Single language (Go for backend + templates)
- Simpler deployment (single binary vs Python + Node.js)
- Aligned with project philosophy (ADR-061)
See ADR-061: Framework-Less Web UI for complete rationale.
Success Metrics
Dashboard Effectiveness
| Metric | Target | Measurement |
|---|---|---|
| MTTR Reduction | 50% faster | Time to identify root cause |
| Issue Detection Before User Reports | 95% | Issues caught by dashboard vs user tickets |
| Developer Adoption | 80% daily usage | Unique dashboard users per day |
| Alert Noise Reduction | 30% fewer false alerts | Alert count before/after dashboard |
| Dashboard Performance | <2s end-to-end latency | Time from metric change to UI update |
Technical Metrics
| Metric | Target |
|---|---|
| WebSocket Uptime | >99.9% |
| API Response Time | P99 <100ms |
| Frontend Load Time | <2s initial load |
| Memory Footprint | <256MB backend, <50MB frontend |
| Data Freshness | <5s lag from source |
Implementation Phases
Phase 1: MVP (Week 1-2)
- View 1: System Health (status banner, pattern grid)
- Backend: Go HTTP server with WebSocket
- Frontend: Go templates + HTMX + D3.js
- Data collectors: Prometheus scraper, gRPC health client
- Deliverable: Working dashboard at
localhost:8095, instant reload
Phase 2: Performance & Backend Views (Week 3)
- View 2: Performance Monitoring (latency, throughput, SLO)
- View 3: Backend Health (connection pools)
- Integrate SigNoz API queries
- D3.js charts for latency/throughput visualization
Phase 3: Messaging & Multi-Tenancy (Week 4)
- View 4: Messaging Flow (PubSub metrics)
- View 5: Multi-Tenancy (namespace breakdown)
- HTMX partial updates for pattern grid
- Mermaid.js diagrams for message flow
Phase 4: Security & Observability (Week 5)
- View 6: Security Monitoring
- View 7: Observability Health
- View 8: System Resources
- Advanced D3.js visualizations (heatmaps, pie charts)
Phase 5: Polish & Production (Week 6)
- Alert history and notification integration
- Dark mode support (CSS variables)
- Mobile-responsive layout (Tailwind optional)
- Single binary build with embedded assets
- Documentation and Makefile targets
Open Questions
-
Should alerts be embedded in dashboard or use separate AlertManager?
- Proposal: Dashboard shows alerts but doesn't manage them (SigNoz AlertManager owns alert logic)
- Reasoning: Separation of concerns; dashboard is read-only view
-
How to handle multi-proxy deployments (future)?
- Proposal: Add proxy selector dropdown in UI, query metrics per proxy
- Reasoning: Defer until multi-proxy is required (post-POC)
-
Should dashboard persist historical data or rely on SigNoz?
- Proposal: No persistence; query SigNoz for historical views
- Reasoning: Avoid data duplication, SigNoz is source of truth
-
Access control for dashboard?
- Proposal: Phase 1 = no auth (local dev), Phase 2 = integrate with Dex OIDC
- Reasoning: Focus on functionality first, add auth for production
-
Mobile app or web-only?
- Proposal: Web-only, responsive design for mobile browsers
- Reasoning: Native mobile app is significant scope expansion
Related Documents
- RFC-016: Local Development Infrastructure - SigNoz/Dex setup
- RFC-018: POC Implementation Strategy - Implementation context
- ADR-028: Admin UI (FastAPI + Ember) - Admin UI architecture (different from HUD)
- ADR-048: Local SigNoz Observability - SigNoz integration
- RFC-008: Proxy Plugin Architecture - Metrics endpoints
References
Dashboard Design Patterns
- Google SRE: Monitoring Distributed Systems
- DataDog Dashboard Best Practices
- Grafana Dashboard Design
Technologies
Revision History
- 2025-11-07: Initial draft with 8 prioritized views and technical architecture