observabilitydashboardmonitoringoperationsuxadmin-ui

Status: DraftAuthor: Platform TeamCreated: Nov 10, 2025Updated: Nov 15, 2025

RFC-050: Operations Dashboard (HUD)

Status: Draft Author: Platform Team Created: 2025-11-07 Updated: 2025-11-07

Abstract

This RFC defines the Prism Operations Dashboard (internally called "HUD" - Heads-Up Display), a real-time web interface providing comprehensive visibility into Prism proxy health, pattern status, backend connectivity, and system performance.

Core Principle: Detect issues before users notice through proactive monitoring of high-probability failure indicators.

The dashboard provides:

Real-time System Health: Overall status with drill-down to component details
Performance Monitoring: Latency, throughput, and SLO compliance tracking
Pattern & Backend Health: Lifecycle state and connection pool monitoring
Messaging Visibility: PubSub message flow and delivery health
Multi-Tenancy Insights: Per-namespace resource usage and error rates
Security Monitoring: Authentication success rates and authorization metrics
Observability Health: OpenTelemetry pipeline status
Resource Utilization: System-level CPU, memory, and capacity tracking

Technology Stack: Go HTTP server (backend), HTMX + D3.js (frontend), WebSocket for real-time updates, Prometheus metrics + SigNoz traces. Framework-less approach with no build step required (see ADR-061).

Motivation

Problem Statement

Current Situation (Post-POC 1-3):

✅ Proxy spawns and manages patterns successfully
✅ Patterns communicate with backends (MemStore, Redis, NATS)
✅ OpenTelemetry traces sent to SigNoz
❌ No unified operational view: Developers must check:
- SigNoz UI for traces
- Docker logs for pattern health
- curl requests for metrics
- Manual gRPC health checks
- Backend-specific tools (redis-cli, nats-cli)

Pain Points:

Reactive Debugging: Issues discovered after failures occur
Scattered Information: No single pane of glass for system health
Slow MTTR: Mean Time To Resolution high due to information gathering overhead
No SLO Visibility: Can't track if meeting latency/reliability targets
Pattern Health Blind Spots: Process crashes/restarts go unnoticed
Backend Issues Surface Late: Connection pool exhaustion detected only after timeouts

Goals

Proactive Issue Detection: Catch problems before user impact (target: 95% issues detected within 30s)
Single Pane of Glass: All critical system health in one view
Fast MTTR: Reduce time to identify root cause from minutes to seconds
SLO Tracking: Visualize compliance with latency/reliability targets
Developer Productivity: Eliminate manual metric gathering during debugging
Production Readiness: Dashboard suitable for both local dev and production

Non-Goals

Application Metrics: Dashboard focuses on Prism infrastructure, not application-level business metrics
Log Aggregation: Logs remain in SigNoz; dashboard shows summary insights only
Alerting Engine: Dashboard surfaces metrics but doesn't replace alert manager (use SigNoz alerts)
Historical Analysis: Focus on real-time (last 24h), not long-term trends (use SigNoz for that)
Multi-Cluster Management: Single cluster/proxy instance view (multi-cluster is future work)

Architecture Overview

System Context

Data Flow

Proxy & Patterns: Emit metrics (Prometheus format) + traces (OTLP)
Dashboard Backend: Aggregates data from multiple sources every 5 seconds
WebSocket Push: Real-time updates to browser clients
Frontend Rendering: HTMX + D3.js display live data with <2s latency

Technology Stack

See ADR-061: Framework-Less Web UI for complete rationale.

Backend (dashboard/):

Language: Go
HTTP Server: net/http standard library + gorilla/mux
Templates: Go html/template (server-side rendering)
Metrics Scraping: Native Prometheus text format parser
gRPC Client: Native Go gRPC client (call pattern health checks)
SigNoz Integration: net/http client (query SigNoz API)
WebSocket: gorilla/websocket (push updates to frontend)
Caching: Optional in-memory cache (reduce SigNoz query load)

Frontend (embedded in dashboard/static/ and dashboard/templates/):

HTML: Go templates (server-rendered, type-safe)
Interactivity: HTMX 1.9+ (14KB, replaces React)
Visualization: D3.js v7 (~70KB, best-in-class charting)
Diagrams: Mermaid.js (~200KB, text-to-diagram)
Styling: Plain CSS or Tailwind (optional)
WebSocket Client: Native browser WebSocket API
Build Step: NONE - no npm, no webpack, instant reload

Deployment:

Binary: Single Go executable (go build)
Assets: Embedded with //go:embed directive (templates + static files)
Development: make dashboard-run starts server (instant reload with air)
Production: Single binary or Docker container (no Node.js runtime needed)

Dashboard Views

View Hierarchy

The dashboard uses a hub-and-spoke model with a primary System Health view and drill-down panels:

┌─────────────────────────────────────────────────────────┐
│  System Health (Hub)                                     │  ← Always visible
│  Overall status, critical metrics, pattern grid          │
├─────────────────────────────────────────────────────────┤
│  Drill-Down Panels (Spoke)                               │  ← Accessed via tabs/clicks
│  ┌──────────────┬──────────────┬──────────────┐        │
│  │ Performance  │ Backend      │ Messaging    │        │
│  │ Monitoring   │ Health       │ Flow         │        │
│  ├──────────────┼──────────────┼──────────────┤        │
│  │ Multi-       │ Security     │ Observability│        │
│  │ Tenancy      │ Monitoring   │ Health       │        │
│  ├──────────────┼──────────────┼──────────────┤        │
│  │ System       │ Alerts       │ Settings     │        │
│  │ Resources    │ History      │              │        │
│  └──────────────┴──────────────┴──────────────┘        │
└─────────────────────────────────────────────────────────┘

View 1: System Health (Primary View)

Priority: CRITICAL (Always visible) Detection Probability: 99% Update Frequency: 2 seconds

Purpose: Single-glance answer to "Is Prism working?"

Layout

┌────────────────────────────────────────────────────────────────┐
│  Prism Operations Dashboard                   [Auto-refresh: 2s]│
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ 🟢 SYSTEM HEALTHY                                           ││
│  │                                                              ││
│  │ ✅ 99.97% Success Rate    📊 8,432 RPS    ⚡ 0.8ms P99      ││
│  │ 🔒 99.8% Auth Success     🔗 3 Patterns    📦 15 Namespaces ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Pattern Health Grid                                             │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ Pattern      Status      Phase      Uptime    Restarts   │  │
│  ├──────────────────────────────────────────────────────────┤  │
│  │ MemStore     🟢 HEALTHY  Running    4d 3h     0          │  │
│  │ Redis        🟢 HEALTHY  Running    4d 3h     0          │  │
│  │ NATS         🟡 DEGRADED Running    2h 15m    2          │◄─┐
│  │ PostgreSQL   🟢 HEALTHY  Running    4d 3h     0          │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                       [Click row for details] ──┘
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Critical Metrics (Last 5 minutes)                               │
│  ┌────────────┬────────────┬────────────┬────────────────────┐ │
│  │ Latency    │ Throughput │ Error Rate │ Backend Pools      │ │
│  ├────────────┼────────────┼────────────┼────────────────────┤ │
│  │ P50: 0.4ms │ Read:  5.2k│ 0.03%      │ Redis: 7/10 🟢     │ │
│  │ P99: 0.8ms │ Write: 3.2k│ 25 errors  │ NATS: Connected 🟡 │ │
│  │ P999: 2.1ms│ Total: 8.4k│ /min       │ PG: 3/20 🟢        │ │
│  │            │            │            │                    │ │
│  │ [Chart]    │ [Chart]    │ [Chart]    │ [Status Grid]      │ │
│  └────────────┴────────────┴────────────┴────────────────────┘ │
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Recent Alerts                                                   │
│  🟡 2m ago: NATS pattern restarted (restart loop detected)       │
│  🟢 15m ago: Redis pool capacity >90% (now resolved)             │
│                                          [View All Alerts →]     │
└────────────────────────────────────────────────────────────────┘

Components

1. System Status Banner

Overall Health: Computed from all patterns (GREEN if all healthy, YELLOW if any degraded, RED if any unhealthy)
Success Rate: (successful_requests / total_requests) * 100 (last 5 min)
RPS: Current requests per second (read + write)
P99 Latency: 99th percentile latency (last 5 min)
Auth Success: Authentication success rate
Active Patterns: Count of patterns in HEALTHY or DEGRADED state
Namespaces: Total active namespaces

2. Pattern Health Grid

Status: 🟢 HEALTHY / 🟡 DEGRADED / 🔴 UNHEALTHY (from gRPC HealthCheck)
Phase: spawn → connect → initialize → start → running
Uptime: Time since last start
Restarts: Restart count in last 24h
Click Action: Drill down to pattern detail view

3. Critical Metrics Charts

Latency Chart: Line chart showing P50/P99/P999 over last 5 minutes
Throughput Chart: Stacked area chart (read vs write RPS)
Error Rate Chart: Bar chart of errors per minute
Backend Pools: Connection pool status for each backend

4. Recent Alerts

Last 5 alerts/warnings with timestamp and auto-resolution status
Color-coded by severity: 🔴 Critical / 🟡 Warning / 🟢 Resolved

Data Sources

// dashboard/handlers/system_health.go

func SystemHealthHandler(w http.ResponseWriter, r *http.Request) {
    // Aggregate system health from all data sources

    // 1. Scrape proxy metrics (Prometheus)
    proxyMetrics, err := scrapePrometheus("http://localhost:8980/metrics")
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    // 2. Query pattern health checks (gRPC)
    patternHealth, err := queryAllPatternsHealth(context.Background())
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    // 3. Query SigNoz for latency percentiles
    latency, err := querySigNozLatency("5m")
    if err != nil {
        log.Printf("SigNoz query error: %v", err)
        // Continue with empty latency data
    }

    // 4. Query admin API for namespace count
    namespaces, err := adminClient.ListNamespaces(context.Background())
    if err != nil {
        log.Printf("Admin API error: %v", err)
    }

    data := SystemHealthData{
        OverallStatus:    computeOverallHealth(patternHealth),
        SuccessRate:      proxyMetrics.SuccessRate,
        RPS:              proxyMetrics.RequestsPerSecond,
        LatencyP99:       latency.P99,
        AuthSuccessRate:  proxyMetrics.AuthSuccessRate,
        Patterns:         patternHealth,
        NamespaceCount:   len(namespaces),
        Timestamp:        time.Now(),
    }

    // Render Go template
    templates.ExecuteTemplate(w, "dashboard.html", data)
}

WebSocket Push

Client-side (in Go template):

<!-- dashboard/templates/dashboard.html -->
<script>
// Native WebSocket API (no React hooks needed)
const ws = new WebSocket('ws://' + location.host + '/ws/system-health');

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  // Update status banner
  document.getElementById('success-rate').textContent = data.success_rate.toFixed(2) + '%';
  document.getElementById('rps').textContent = data.rps;
  document.getElementById('latency-p99').textContent = data.latency_p99.toFixed(1) + 'ms';

  // Update charts with D3.js
  updateLatencyChart(data.latency_history);
  updateThroughputChart(data.throughput_history);

  // Update last refresh time
  document.getElementById('last-update').textContent = new Date().toLocaleTimeString();
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
  document.getElementById('status-indicator').className = 'status-error';
};
</script>

View 2: Performance Monitoring

Priority: HIGH Detection Probability: 95% Update Frequency: 5 seconds

Purpose: Detailed latency, throughput, and SLO compliance tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│  Performance Monitoring                                          │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SLO Compliance                                                  │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Target: 99.9% requests < 10ms (P99)                         ││
│  │ Current: 99.7% ✅                                            ││
│  │ [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 99.7% ││
│  │                                                              ││
│  │ Last 24h: 99.8% ✅  |  Last 7d: 99.9% ✅                    ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Latency Breakdown (Last 1 hour)                                 │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Operation    P50      P99      P999     Max      Count     ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ KeyValue.Set 0.3ms    0.7ms    1.2ms    4.5ms   125k      ││
│  │ KeyValue.Get 0.2ms    0.5ms    0.9ms    3.1ms   287k      ││
│  │ PubSub.Pub   0.4ms    0.9ms    2.1ms    8.7ms   42k       ││
│  │ PubSub.Sub   0.8ms    2.3ms    5.4ms    15.2ms  18k       ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
│  Latency Heatmap (Time vs Percentile)                            │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ 10ms ┤                                              ▂▃      ││
│  │  5ms ┤                                    ▁▂▃▄▅▆▇███        ││
│  │  1ms ┤                          ▁▂▃▄▅▆▇███                 ││
│  │ 0.5ms┤            ▁▂▃▄▅▆▇████████                          ││
│  │ 0.1ms┤▁▂▃▄▅▆▇█████                                         ││
│  │      └────────────────────────────────────────────────────┘││
│  │      12:00  12:15  12:30  12:45  13:00  13:15  13:30      ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Throughput by Pattern                                           │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ [Stacked Area Chart]                                        ││
│  │ 10k ┤                                              ███      ││
│  │  8k ┤                                        ▄▄▄▄▄███      ││
│  │  6k ┤                                  ▃▃▃▃▃█████████      ││
│  │  4k ┤                          ▂▂▂▂▂▂██████████████       ││
│  │  2k ┤              ▁▁▁▁▁▁▁▁▁▁████████████████████         ││
│  │     └────────────────────────────────────────────────────┘││
│  │      MemStore ▀▀▀  Redis ▀▀▀  NATS ▀▀▀  PostgreSQL ▀▀▀    ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Key Features

SLO Compliance Tracker: Visual progress bar showing % of requests meeting latency target
Operation-Level Latency: Breakdown by operation type (Set, Get, Publish, Subscribe)
Latency Heatmap: Visualize latency distribution over time (identify spikes)
Throughput by Pattern: See which patterns are handling most traffic

Data Sources

Latency Percentiles: Query SigNoz traces aggregated by operation
SLO Compliance: Count requests with latency < 10ms / total requests
Throughput: Scrape Prometheus metrics from proxy (prism_requests_total counter)

View 3: Backend Health

Priority: HIGH Detection Probability: 92% Update Frequency: 5 seconds

Purpose: Connection pool health and backend connectivity status

Layout

┌────────────────────────────────────────────────────────────────┐
│  Backend Health                                                  │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Connection Pools                                                │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Backend    Type       Active  Idle  Max  Util  Status      ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ Redis      KeyValue   7       3     10   70%   🟢 HEALTHY  ││
│  │            PubSub     2       8     10   20%   🟢 HEALTHY  ││
│  │ NATS       PubSub     1       0     1    100%  🟡 DEGRADED ││
│  │ PostgreSQL KeyValue   3       17    20   15%   🟢 HEALTHY  ││
│  │            Queue      5       15    20   25%   🟢 HEALTHY  ││
│  │ MemStore   KeyValue   N/A     N/A   N/A  N/A   🟢 HEALTHY  ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Connection Metrics (Last 5 minutes)                             │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Redis                                                        ││
│  │ ├─ Connections: [Time series chart]                         ││
│  │ ├─ Acquisition Time: Avg 2.1ms  P99 8.3ms                   ││
│  │ ├─ Errors: 0 refused, 0 timeout, 0 reset                    ││
│  │ └─ Pool Capacity: [Progress bar] 70%                        ││
│  │                                                              ││
│  │ NATS                                                         ││
│  │ ├─ Connection State: CONNECTED (reconnects: 2)              ││
│  │ ├─ Subscriptions: 127 active                                ││
│  │ ├─ Stats: In: 42k msgs (8.4 MB)  Out: 18k msgs (1.2 MB)    ││
│  │ └─ Pending: 0 messages                                      ││
│  │                                                              ││
│  │ PostgreSQL                                                   ││
│  │ ├─ Connections: 8/20 active                                 ││
│  │ ├─ Active Queries: 3                                        ││
│  │ ├─ Query Duration: Avg 12ms  P99 45ms                       ││
│  │ └─ Pool Wait Time: Avg 0.3ms                                ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Key Features

Connection Pool Table: Shows active/idle/max connections per backend
Utilization Tracking: Visual indicator when approaching capacity (>90% = yellow)
Connection Acquisition Time: How long to get a connection from pool
Error Tracking: Connection refused, timeout, reset counts
Backend-Specific Metrics:
- Redis: Pool stats from PoolStats()
- NATS: Connection state, subscription count, message stats
- PostgreSQL: Active queries, query duration

Data Sources

// patterns/redis/plugin.go

func (r *RedisPlugin) HealthCheck(ctx context.Context) *HealthCheckResponse {
    stats := r.client.PoolStats()

    return &HealthCheckResponse{
        Status: computeStatus(stats),
        Metadata: map[string]string{
            "connections_active": fmt.Sprintf("%d", stats.TotalConns - stats.IdleConns),
            "connections_idle": fmt.Sprintf("%d", stats.IdleConns),
            "pool_size": fmt.Sprintf("%d", r.config.PoolSize),
            "utilization_pct": fmt.Sprintf("%.1f", float64(stats.TotalConns - stats.IdleConns) / float64(r.config.PoolSize) * 100),
        },
    }
}

View 4: Messaging Flow (PubSub)

Priority: MEDIUM-HIGH Detection Probability: 88% Update Frequency: 5 seconds

Purpose: PubSub message delivery health and subscriber tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│  Messaging Flow (PubSub)                                         │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Message Flow Health                                             │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Published:  42,187 msgs  |  Delivered: 126,561 msgs ✅      ││
│  │ Dropped:    12 msgs (0.03%)  |  Pending: 0 msgs            ││
│  │                                                              ││
│  │ Delivery Ratio: 3.0x (fanout working correctly)             ││
│  │ Delivery Latency: P99 2.3ms ✅                              ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Active Topics & Subscribers                                     │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Topic              Subs  Pub/sec  Del/sec  Latency  Status ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ events.user        3     142      426      1.2ms    🟢     ││
│  │ events.system      5     87       435      0.9ms    🟢     ││
│  │ logs.application   1     523      523      0.4ms    🟢     ││
│  │ alerts.critical    12    2        24       15.2ms   🟡     ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Message Timeline (Last 5 minutes)                               │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Published ▀▀▀                                                ││
│  │ Delivered ▀▀▀                                                ││
│  │ Dropped   ▀▀▀                                                ││
│  │                                                              ││
│  │ [Multi-line chart showing published vs delivered vs dropped]││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Subscriber Details                                              │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Subscriber ID        Topic            Msgs Recv  Lag       ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ sub-worker-1         events.user      14,235     0ms       ││
│  │ sub-worker-2         events.user      14,190     0ms       ││
│  │ sub-worker-3         events.user      14,201     0ms       ││
│  │ sub-analytics        events.system    8,745      0ms       ││
│  │ sub-logger           logs.application 52,301     125ms  🟡 ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Key Features

Message Flow Summary: Published vs Delivered (should be N:1 for fanout)
Dropped Messages: Count and percentage (should be near zero)
Delivery Latency: Time from publish to subscriber receive
Topic Breakdown: Per-topic subscriber count and throughput
Subscriber Lag: Identify slow subscribers (lag >100ms = warning)

Data Sources

// patterns/nats/plugin.go

func (n *NATSPlugin) HealthCheck(ctx context.Context) *HealthCheckResponse {
    stats := n.conn.Stats()

    return &HealthCheckResponse{
        Status: HEALTHY,
        Metadata: map[string]string{
            "subscription_count": fmt.Sprintf("%d", len(n.subscriptions)),
            "in_msgs": fmt.Sprintf("%d", stats.InMsgs),
            "out_msgs": fmt.Sprintf("%d", stats.OutMsgs),
            "in_bytes": fmt.Sprintf("%d", stats.InBytes),
            "out_bytes": fmt.Sprintf("%d", stats.OutBytes),
            "dropped_msgs": fmt.Sprintf("%d", n.droppedMessageCount),
        },
    }
}

View 5: Multi-Tenancy (Namespaces)

Priority: MEDIUM Detection Probability: 85% Update Frequency: 10 seconds

Purpose: Per-namespace resource usage and error tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│  Multi-Tenancy (Namespaces)                                      │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Active Namespaces (15 total)                                    │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Namespace         RPS    Latency  Errors  Patterns  Status ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ user-platform     4,231  0.8ms    0.02%   KV, PS    🟢     ││
│  │ payments          1,847  1.2ms    0.01%   KV, Q     🟢     ││
│  │ analytics         892    2.3ms    0.05%   PS, TS    🟢     ││
│  │ notifications     645    0.9ms    1.2%    PS        🟡     ││
│  │ search-index      387    5.4ms    0.03%   KV, G     🟢     ││
│  │ ... (10 more)     ...    ...      ...     ...       ...    ││
│  └────────────────────────────────────────────────────────────┘│
│                                [Show All →]                      │
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Traffic Distribution (Top 10 by RPS)                            │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ [Pie Chart]                                                  ││
│  │                                                              ││
│  │     user-platform: 50.2%                                    ││
│  │     payments: 21.9%                                         ││
│  │     analytics: 10.6%                                        ││
│  │     notifications: 7.6%                                     ││
│  │     others: 9.7%                                            ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Namespace Details: user-platform                                │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Configuration                                                ││
│  │ ├─ Patterns: KeyValue (Redis), PubSub (NATS)                ││
│  │ ├─ Created: 2025-10-15                                      ││
│  │ ├─ Owner: team-user-platform@example.com                    ││
│  │ └─ Max RPS: 10,000 (current: 42%)                           ││
│  │                                                              ││
│  │ Performance (Last 1 hour)                                    ││
│  │ ├─ P99 Latency: [Chart]                                     ││
│  │ ├─ Throughput: [Chart]                                      ││
│  │ └─ Error Rate: [Chart]                                      ││
│  │                                                              ││
│  │ Top Operations                                               ││
│  │ ├─ KeyValue.Get: 3,201 RPS                                  ││
│  │ ├─ KeyValue.Set: 987 RPS                                    ││
│  │ └─ PubSub.Publish: 43 RPS                                   ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Key Features

Namespace Table: RPS, latency, error rate per namespace
Traffic Distribution: Identify noisy neighbors (>80% of traffic)
Namespace Drill-Down: Detailed metrics for selected namespace
Capacity Tracking: Show RPS vs max configured capacity
Pattern Usage: Which patterns each namespace uses

Data Sources

Admin API: Query namespace configurations
SigNoz Traces: Filter by namespace tag for per-namespace metrics
Proxy Metrics: Scrape prism_requests_total{namespace="..."} label

View 6: Security Monitoring

Priority: MEDIUM Detection Probability: 80% Update Frequency: 10 seconds

Purpose: Authentication and authorization tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│  Security Monitoring                                             │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Authentication Health                                           │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ JWT Validation Success: 99.8% ✅                             ││
│  │ Token Refreshes: 127 (last 1h)                              ││
│  │ Failed Attempts: 8 (last 1h)                                ││
│  │ Dex Connectivity: 🟢 CONNECTED                              ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Failed Authentication Attempts (Last 24 hours)                  │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Time        Reason              Source IP       User        ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ 13:24:15    Token expired        10.0.1.45      alice@      ││
│  │ 13:18:42    Invalid signature    10.0.2.12      unknown     ││
│  │ 12:45:33    Token expired        10.0.1.45      alice@      ││
│  │ 11:32:18    Missing token        10.0.3.88      unknown     ││
│  │ 11:15:07    Invalid issuer       10.0.2.99      unknown     ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Authorization (Last 1 hour)                                     │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Total Requests: 508,432                                     ││
│  │ Authorized: 508,401 (99.99%)                                ││
│  │ Denied: 31 (0.01%)                                          ││
│  │                                                              ││
│  │ Denial Reasons:                                             ││
│  │ ├─ Namespace access denied: 18                              ││
│  │ ├─ Pattern not allowed: 8                                   ││
│  │ ├─ Rate limit exceeded: 5                                   ││
│  │ └─ Invalid operation: 0                                     ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  JWKS Cache                                                      │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Cache Hits: 508,401 (99.99%)                                ││
│  │ Cache Misses: 31 (0.01%)                                    ││
│  │ Last Refresh: 13:15:42 (15 minutes ago)                     ││
│  │ Next Refresh: 13:45:42 (in 15 minutes)                      ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Key Features

Auth Success Rate: Should be >99% (excludes expected token expirations)
Failed Attempts Log: Investigate suspicious patterns (same IP, repeated failures)
Authorization Denials: Track why requests are denied
JWKS Cache Health: Ensure public key cache is working

View 7: Observability Health

Priority: MEDIUM-LOW Detection Probability: 75% Update Frequency: 30 seconds

Purpose: Verify OpenTelemetry pipeline is working

Layout

┌────────────────────────────────────────────────────────────────┐
│  Observability Health                                            │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  OpenTelemetry Pipeline Status                                   │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Component              Status      Last Exported            ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ Prism Proxy → SigNoz   🟢 HEALTHY  2 seconds ago           ││
│  │ Patterns → SigNoz      🟢 HEALTHY  3 seconds ago           ││
│  │ OTLP Collector         🟢 HEALTHY  1 second ago            ││
│  │ SigNoz (Query Service) 🟢 HEALTHY  5 seconds ago           ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Export Metrics (Last 5 minutes)                                 │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Traces Exported: 12,487                                     ││
│  │ Metrics Exported: 52,301                                    ││
│  │ Logs Exported: 8,945                                        ││
│  │                                                              ││
│  │ Export Errors: 3 (0.02%)                                    ││
│  │ Export Latency: P99 12ms                                    ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Trace Coverage                                                  │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Traces with All Spans: 12,401 (99.3%)                      ││
│  │ Traces Missing Spans: 86 (0.7%)                            ││
│  │                                                              ││
│  │ Missing Spans Breakdown:                                    ││
│  │ ├─ Backend span missing: 45                                 ││
│  │ ├─ Pattern span missing: 31                                 ││
│  │ └─ Proxy span missing: 10                                   ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Key Features

Pipeline Status: Verify all components exporting telemetry
Export Metrics: Count of traces/metrics/logs exported
Trace Coverage: Identify missing spans (should have proxy + pattern + backend)

View 8: System Resources

Priority: LOW Detection Probability: 70% Update Frequency: 10 seconds

Purpose: CPU, memory, and capacity tracking

Layout

┌────────────────────────────────────────────────────────────────┐
│  System Resources                                                │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Prism Proxy                                                     │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Memory: 287 MB / 500 MB (57%)  [████████▌         ]        ││
│  │ CPU: 12% (0.48 cores)                                       ││
│  │ Threads: 24                                                 ││
│  │ File Descriptors: 156 / 1024 (15%)                          ││
│  │ Uptime: 4d 3h 24m                                           ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  Pattern Processes                                               │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Pattern      Memory    CPU     Goroutines  Uptime          ││
│  ├────────────────────────────────────────────────────────────┤│
│  │ MemStore     12 MB     2%      8           4d 3h           ││
│  │ Redis        45 MB     5%      12          4d 3h           ││
│  │ NATS         38 MB     8%      15          2h 15m          ││
│  │ PostgreSQL   67 MB     3%      20          4d 3h           ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
├────────────────────────────────────────────────────────────────┤
│  System Totals                                                   │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Total Memory: 449 MB                                        ││
│  │ Total CPU: 30% (1.2 cores)                                  ││
│  │ Total Processes: 5                                          ││
│  └────────────────────────────────────────────────────────────┘│
│                                                                  │
└────────────────────────────────────────────────────────────────┘

Technical Implementation

See ADR-061: Framework-Less Web UI for complete implementation details and code examples.

This section provides a high-level overview. ADR-061 contains comprehensive Go code examples, HTMX patterns, and D3.js visualization code.

Backend Architecture

dashboard/
├── main.go                    # Go HTTP server entry point
├── config.go                  # Configuration (env vars)
├── handlers/
│   ├── dashboard.go           # View 1 (System Health)
│   ├── performance.go         # View 2
│   ├── backends.go            # View 3
│   ├── messaging.go           # View 4
│   ├── namespaces.go          # View 5
│   ├── security.go            # View 6
│   ├── observability.go       # View 7
│   ├── resources.go           # View 8
│   ├── api.go                 # JSON API for HTMX
│   └── websocket.go           # WebSocket hub
├── collectors/
│   ├── prometheus.go          # Scrape proxy metrics
│   ├── grpc_health.go         # Query pattern health checks
│   ├── signoz.go              # Query SigNoz API
│   └── admin.go               # Query Admin API
├── aggregators/
│   ├── system_health.go       # Aggregate system health
│   ├── performance.go         # Compute percentiles, SLO
│   └── messaging.go           # Compute message flow metrics
├── templates/
│   ├── dashboard.html         # Main dashboard template
│   ├── performance.html       # Performance view template
│   ├── backends.html          # Backend health template
│   └── partials/
│       ├── pattern_grid.html  # Reusable pattern grid
│       └── metrics_chart.html # Reusable chart component
├── static/
│   ├── css/
│   │   └── dashboard.css      # Custom styles
│   ├── js/
│   │   ├── htmx.min.js        # HTMX (14KB)
│   │   ├── d3.v7.min.js       # D3.js (70KB)
│   │   ├── mermaid.min.js     # Mermaid.js (200KB)
│   │   └── dashboard.js       # Custom WebSocket + D3 logic
│   └── assets/
│       └── prism-logo.svg     # Static assets
└── embed.go                   # Embed templates/static with go:embed

Go Backend Example

See ADR-061 for complete implementation with all collectors and aggregators.

// dashboard/main.go

package main

import (
	"embed"
	"html/template"
	"log"
	"net/http"
	"time"

	"github.com/gorilla/mux"
	"github.com/gorilla/websocket"
)

//go:embed templates/* static/*
var content embed.FS

var (
	templates *template.Template
	upgrader  = websocket.Upgrader{
		ReadBufferSize:  1024,
		WriteBufferSize: 1024,
	}
)

func init() {
	templates = template.Must(template.ParseFS(content, "templates/*.html", "templates/partials/*.html"))
}

func main() {
	r := mux.NewRouter()

	// Serve static files (embedded)
	r.PathPrefix("/static/").Handler(http.FileServer(http.FS(content)))

	// Page routes (render Go templates)
	r.HandleFunc("/", dashboardHandler).Methods("GET")
	r.HandleFunc("/performance", performanceHandler).Methods("GET")
	r.HandleFunc("/backends", backendsHandler).Methods("GET")

	// API routes (JSON/HTML fragments for HTMX)
	r.HandleFunc("/api/health", apiHealthHandler).Methods("GET")
	r.HandleFunc("/api/patterns", apiPatternsHandler).Methods("GET")
	r.HandleFunc("/api/performance", apiPerformanceHandler).Methods("GET")

	// WebSocket for real-time updates
	r.HandleFunc("/ws", wsHandler)

	// Start background data collector
	go startCollector()

	log.Println("Dashboard starting on :8095")
	log.Fatal(http.ListenAndServe(":8095", r))
}

func dashboardHandler(w http.ResponseWriter, r *http.Request) {
	data := collectSystemHealth()
	templates.ExecuteTemplate(w, "dashboard.html", data)
}

func apiHealthHandler(w http.ResponseWriter, r *http.Request) {
	// Return HTML fragment for HTMX partial replacement
	data := collectSystemHealth()
	templates.ExecuteTemplate(w, "pattern_grid.html", data)
}

func wsHandler(w http.ResponseWriter, r *http.Request) {
	conn, err := upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Println("WebSocket upgrade error:", err)
		return
	}
	defer conn.Close()

	ticker := time.NewTicker(2 * time.Second)
	defer ticker.Stop()

	for range ticker.C {
		data := collectSystemHealth()
		if err := conn.WriteJSON(data); err != nil {
			return
		}
	}
}

Go Template Example (Server-Rendered HTML)

<!-- dashboard/templates/dashboard.html -->
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Prism Operations Dashboard</title>
  <link rel="stylesheet" href="/static/css/dashboard.css">
  <script src="/static/js/htmx.min.js"></script>
  <script src="/static/js/d3.v7.min.js"></script>
</head>
<body>
  <header>
    <h1>Prism Operations Dashboard</h1>
    <div class="refresh-indicator" id="last-update">Last updated: {{.Timestamp.Format "15:04:05"}}</div>
  </header>

  <main>
    <!-- System Status Banner -->
    <section class="status-banner status-{{.OverallStatus}}">
      <div class="status-icon">{{if eq .OverallStatus "HEALTHY"}}🟢{{else if eq .OverallStatus "DEGRADED"}}🟡{{else}}🔴{{end}}</div>
      <div class="status-text">SYSTEM {{.OverallStatus}}</div>
      <div class="metrics-summary">
        <span class="metric">✅ {{printf "%.2f" .SuccessRate}}% Success Rate</span>
        <span class="metric">📊 {{.RPS}} RPS</span>
        <span class="metric">⚡ {{printf "%.1f" .LatencyP99}}ms P99</span>
      </div>
    </section>

    <!-- Pattern Health Grid (Auto-refresh with HTMX) -->
    <section class="pattern-health">
      <h2>Pattern Health</h2>
      <div hx-get="/api/patterns" hx-trigger="load, every 5s" hx-swap="innerHTML">
        {{template "pattern_grid.html" .}}
      </div>
    </section>

    <!-- Critical Metrics Charts (D3.js) -->
    <section class="metrics-charts">
      <h2>Critical Metrics (Last 5 minutes)</h2>
      <div class="chart-grid">
        <div id="latency-chart" class="chart"></div>
        <div id="throughput-chart" class="chart"></div>
        <div id="error-chart" class="chart"></div>
      </div>
    </section>

    <!-- Recent Alerts -->
    <section class="alerts">
      <h2>Recent Alerts</h2>
      <ul class="alert-list" hx-get="/api/alerts" hx-trigger="load, every 10s" hx-swap="innerHTML">
        {{range .RecentAlerts}}
        <li class="alert-{{.Severity}}">
          <span class="alert-time">{{.Timestamp.Format "15:04"}}</span>
          {{.Message}}
        </li>
        {{end}}
      </ul>
    </section>
  </main>

  <script src="/static/js/dashboard.js"></script>
  <script>
    // Render D3.js charts on load
    renderLatencyChart({{.LatencyHistory}});
    renderThroughputChart({{.ThroughputHistory}});
    renderErrorChart({{.ErrorHistory}});

    // WebSocket for real-time updates
    const ws = new WebSocket('ws://' + location.host + '/ws');
    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      updateCharts(data);
      document.getElementById('last-update').textContent = 'Last updated: ' + new Date().toLocaleTimeString();
    };
  </script>
</body>
</html>

Data Flow Sequence

Deployment

Local Development

# Start dashboard (single command, instant reload with air)
cd dashboard
go run main.go

# OR with live reload
air

# Dashboard available at http://localhost:8095
# No build step required!

Production Binary

# Build single binary with embedded assets
cd dashboard
go build -o ../bin/prism-dashboard main.go

# Binary size: ~15MB (includes templates + static assets)
# Run anywhere (no dependencies)
./bin/prism-dashboard

Docker Compose (Optional)

# docker-compose.dashboard.yml

version: '3.8'

services:
  dashboard:
    build:
      context: ./dashboard
      dockerfile: Dockerfile
    container_name: prism-dashboard
    ports:
      - "8095:8095"
    environment:
      - PROXY_METRICS_URL=http://prism-proxy:8980/metrics
      - SIGNOZ_API_URL=http://signoz-query:8080
      - ADMIN_API_URL=http://prism-admin:8090
    networks:
      - prism

networks:
  prism:
    external: true

# dashboard/Dockerfile

FROM golang:1.21-alpine AS builder
WORKDIR /build
COPY . .
RUN go build -o prism-dashboard main.go

FROM alpine:latest
COPY --from=builder /build/prism-dashboard /usr/local/bin/
EXPOSE 8095
CMD ["prism-dashboard"]

Makefile Targets

# Makefile

.PHONY: dashboard-run dashboard-build dashboard-dev dashboard-up dashboard-down

dashboard-run:
	@echo "Starting Prism Dashboard (Go)..."
	cd dashboard && go run main.go

dashboard-build:
	@echo "Building Dashboard binary..."
	cd dashboard && go build -o ../bin/prism-dashboard main.go
	@echo "Binary: bin/prism-dashboard (size: $(shell du -h bin/prism-dashboard | cut -f1))"

dashboard-dev:
	@echo "Starting Dashboard with live reload..."
	cd dashboard && air
	# Install air: go install github.com/cosmtrek/air@latest

dashboard-up:
	@echo "Starting Prism Operations Dashboard (Docker)..."
	docker-compose -f docker-compose.dashboard.yml up -d
	@echo "Dashboard: http://localhost:8095"

dashboard-down:
	docker-compose -f docker-compose.dashboard.yml down

dashboard-assets:
	@echo "Downloading frontend assets..."
	mkdir -p dashboard/static/js
	curl -o dashboard/static/js/htmx.min.js https://unpkg.com/htmx.org@1.9.10/dist/htmx.min.js
	curl -o dashboard/static/js/d3.v7.min.js https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js
	curl -o dashboard/static/js/mermaid.min.js https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js

Alternatives Considered

Alternative 1: Grafana Dashboards

Pros:

Industry standard, mature
Excellent charting library
Integrates with Prometheus/SigNoz
Alerting built-in

Cons:

Generic (not tailored to Prism's specific patterns)
Requires learning Grafana query language
Less real-time feel (polling-based)
No custom interaction (e.g., click pattern to drill down)

Rejected because: Custom dashboard provides better UX for Prism-specific workflows

Alternative 2: SigNoz-Only Approach

Pros:

Already using SigNoz
No additional service to maintain
Built-in trace/metric visualization

Cons:

Not optimized for operational health monitoring
No pattern-specific views
Can't aggregate across multiple data sources (SigNoz + Admin API + pattern health checks)
No custom alerts/thresholds

Rejected because: SigNoz is for debugging/analysis, not operational monitoring

Alternative 3: Prometheus + AlertManager Only

Pros:

Simple, battle-tested
Low resource footprint
Alert-focused

Cons:

No visual dashboard (alerts via notifications only)
Requires configuring complex PromQL queries
No real-time drill-down
Limited context (just metrics, no traces)

Rejected because: Alerts are reactive; dashboard is proactive

Alternative 4: Framework-Based (FastAPI + React)

Pros:

Rich React component ecosystem
FastAPI modern Python framework
TypeScript type safety
Popular stack, easy to hire for

Cons:

❌ Build complexity: npm install (5min), webpack build (10-60s)
❌ Dependency hell: 500+ npm packages
❌ Slow iteration: Change → rebuild → reload (10s+)
❌ Large bundles: 2-5MB JavaScript
❌ Framework churn: React/Vue/Svelte versions change
❌ Language mismatch: Python + JavaScript = context switching
❌ Debugging complexity: Source maps, transpilation issues

Rejected because: Framework-less approach (Go + HTMX + D3.js) provides:

Instant reload (no build step)
300KB bundle vs 2-5MB
Single language (Go for backend + templates)
Simpler deployment (single binary vs Python + Node.js)
Aligned with project philosophy (ADR-061)

See ADR-061: Framework-Less Web UI for complete rationale.

Success Metrics

Dashboard Effectiveness

Metric	Target	Measurement
MTTR Reduction	50% faster	Time to identify root cause
Issue Detection Before User Reports	95%	Issues caught by dashboard vs user tickets
Developer Adoption	80% daily usage	Unique dashboard users per day
Alert Noise Reduction	30% fewer false alerts	Alert count before/after dashboard
Dashboard Performance	<2s end-to-end latency	Time from metric change to UI update

Technical Metrics

Metric	Target
WebSocket Uptime	>99.9%
API Response Time	P99 <100ms
Frontend Load Time	<2s initial load
Memory Footprint	<256MB backend, <50MB frontend
Data Freshness	<5s lag from source

Implementation Phases

Phase 1: MVP (Week 1-2)

View 1: System Health (status banner, pattern grid)
Backend: Go HTTP server with WebSocket
Frontend: Go templates + HTMX + D3.js
Data collectors: Prometheus scraper, gRPC health client
Deliverable: Working dashboard at localhost:8095, instant reload

Phase 2: Performance & Backend Views (Week 3)

View 2: Performance Monitoring (latency, throughput, SLO)
View 3: Backend Health (connection pools)
Integrate SigNoz API queries
D3.js charts for latency/throughput visualization

Phase 3: Messaging & Multi-Tenancy (Week 4)

View 4: Messaging Flow (PubSub metrics)
View 5: Multi-Tenancy (namespace breakdown)
HTMX partial updates for pattern grid
Mermaid.js diagrams for message flow

Phase 4: Security & Observability (Week 5)

View 6: Security Monitoring
View 7: Observability Health
View 8: System Resources
Advanced D3.js visualizations (heatmaps, pie charts)

Phase 5: Polish & Production (Week 6)

Alert history and notification integration
Dark mode support (CSS variables)
Mobile-responsive layout (Tailwind optional)
Single binary build with embedded assets
Documentation and Makefile targets

Open Questions

Should alerts be embedded in dashboard or use separate AlertManager?
- Proposal: Dashboard shows alerts but doesn't manage them (SigNoz AlertManager owns alert logic)
- Reasoning: Separation of concerns; dashboard is read-only view
How to handle multi-proxy deployments (future)?
- Proposal: Add proxy selector dropdown in UI, query metrics per proxy
- Reasoning: Defer until multi-proxy is required (post-POC)
Should dashboard persist historical data or rely on SigNoz?
- Proposal: No persistence; query SigNoz for historical views
- Reasoning: Avoid data duplication, SigNoz is source of truth
Access control for dashboard?
- Proposal: Phase 1 = no auth (local dev), Phase 2 = integrate with Dex OIDC
- Reasoning: Focus on functionality first, add auth for production
Mobile app or web-only?
- Proposal: Web-only, responsive design for mobile browsers
- Reasoning: Native mobile app is significant scope expansion

RFC-016: Local Development Infrastructure - SigNoz/Dex setup
RFC-018: POC Implementation Strategy - Implementation context
ADR-028: Admin UI (FastAPI + Ember) - Admin UI architecture (different from HUD)
ADR-048: Local SigNoz Observability - SigNoz integration
RFC-008: Proxy Plugin Architecture - Metrics endpoints

References

Dashboard Design Patterns

Technologies

Revision History

2025-11-07: Initial draft with 8 prioritized views and technical architecture

Abstract​

Motivation​

Problem Statement​

Goals​

Non-Goals​

Architecture Overview​

System Context​

Data Flow​

Technology Stack​

Dashboard Views​

View Hierarchy​

View 1: System Health (Primary View)​

Layout​

Components​

Data Sources​

WebSocket Push​

View 2: Performance Monitoring​

Layout​

Key Features​

Data Sources​

View 3: Backend Health​

Layout​

Key Features​

Data Sources​

View 4: Messaging Flow (PubSub)​

Layout​

Key Features​

Data Sources​

View 5: Multi-Tenancy (Namespaces)​

Layout​

Key Features​

Data Sources​

View 6: Security Monitoring​

Layout​

Key Features​

View 7: Observability Health​

Layout​

Key Features​

View 8: System Resources​

Layout​

Technical Implementation​

Backend Architecture​

Go Backend Example​

Go Template Example (Server-Rendered HTML)​

Data Flow Sequence​

Deployment​

Local Development​

Production Binary​

Docker Compose (Optional)​

Makefile Targets​

Alternatives Considered​

Alternative 1: Grafana Dashboards​

Alternative 2: SigNoz-Only Approach​

Alternative 3: Prometheus + AlertManager Only​

Alternative 4: Framework-Based (FastAPI + React)​

Success Metrics​

Dashboard Effectiveness​

Technical Metrics​

Implementation Phases​

Phase 1: MVP (Week 1-2)​

Phase 2: Performance & Backend Views (Week 3)​

Phase 3: Messaging & Multi-Tenancy (Week 4)​

Phase 4: Security & Observability (Week 5)​

Phase 5: Polish & Production (Week 6)​

Open Questions​

Related Documents​

References​

Dashboard Design Patterns​

Technologies​

Revision History​

Abstract

Motivation

Problem Statement

Goals

Non-Goals

Architecture Overview

System Context

Data Flow

Technology Stack

Dashboard Views

View Hierarchy

View 1: System Health (Primary View)

Layout

Components

Data Sources

WebSocket Push

View 2: Performance Monitoring

Layout

Key Features

Data Sources

View 3: Backend Health

Layout

Key Features

Data Sources

View 4: Messaging Flow (PubSub)

Layout

Key Features

Data Sources

View 5: Multi-Tenancy (Namespaces)

Layout

Key Features

Data Sources

View 6: Security Monitoring

Layout

Key Features

View 7: Observability Health

Layout

Key Features

View 8: System Resources

Layout

Technical Implementation

Backend Architecture

Go Backend Example

Go Template Example (Server-Rendered HTML)

Data Flow Sequence

Deployment

Local Development

Production Binary

Docker Compose (Optional)

Makefile Targets

Alternatives Considered

Alternative 1: Grafana Dashboards

Alternative 2: SigNoz-Only Approach

Alternative 3: Prometheus + AlertManager Only

Alternative 4: Framework-Based (FastAPI + React)

Success Metrics

Dashboard Effectiveness

Technical Metrics

Implementation Phases

Phase 1: MVP (Week 1-2)

Phase 2: Performance & Backend Views (Week 3)

Phase 3: Messaging & Multi-Tenancy (Week 4)

Phase 4: Security & Observability (Week 5)

Phase 5: Polish & Production (Week 6)

Open Questions

Related Documents

References

Dashboard Design Patterns

Technologies

Revision History