observabilitytestinglocal-developmentopentelemetrysignoz

Status: AcceptedDeciders: Platform TeamDate: Oct 9, 2025

ADR-048: Local Signoz Instance for Observability Testing

Status

Accepted - 2025-10-09

Context

Problem Statement

Developers need comprehensive observability during local development and testing:

Traces: See request flows through proxy → plugin → backend
Metrics: Monitor latency, throughput, error rates
Logs: Correlate structured logs with traces
OpenTelemetry: Native OTLP support for Prism instrumentation

Current Situation:

No local observability stack
Developers rely on console logs and printf debugging
Hard to debug distributed systems issues (proxy + plugins + backends)
No way to validate OpenTelemetry instrumentation before production

Requirements:

Full OpenTelemetry stack (traces, metrics, logs)
Runs independently from other local infrastructure
Minimal resource footprint (< 2GB RAM)
Fast startup (< 30 seconds)
Pre-configured for Prism components
Persist data across restarts

Why Signoz?

Evaluated Options:

Jaeger (traces only)
Prometheus + Grafana (metrics + visualization, complex setup)
Elastic Stack (heavy, 4GB+ RAM)
Signoz (all-in-one OpenTelemetry platform)

Signoz Advantages:

✅ Native OpenTelemetry: OTLP gRPC/HTTP receivers built-in
✅ All-in-one: Traces, metrics, logs in single UI
✅ Lightweight: ~1.5GB RAM with ClickHouse backend
✅ Fast queries: ClickHouse columnar storage
✅ APM features: Service maps, dependency graphs, alerts
✅ Open source: Apache 2.0 license
✅ Docker Compose: Pre-built stack available

Comparison:

Feature	Signoz	Jaeger	Prometheus + Grafana	Elastic Stack
Traces	✅	✅	❌	✅
Metrics	✅	❌	✅	✅
Logs	✅	❌	❌ (via Loki)	✅
OpenTelemetry Native	✅	✅	❌	⚠️ (via collector)
Resource Usage	~1.5GB	~500MB	~2GB	~4GB
Setup Complexity	Low	Low	Medium	High
Query Performance	Fast (ClickHouse)	Medium	Fast (Prometheus)	Medium
APM Features	✅	⚠️ Basic	⚠️ Manual	✅

Signoz Architecture: Signoz Components: ├── OTLP Receiver (gRPC :4317, HTTP :4318) ├── Query Service (API + UI :3301) ├── ClickHouse (storage :9000) └── AlertManager (optional)

## Decision

**We will provide a local Signoz instance as part of the development support tooling.**

### Key Decisions

1. **Signoz as Standard Observability Platform**
   - Use Signoz for local development observability
   - Pre-configured for Prism proxy, plugins, and backends
   - Standardize on OpenTelemetry for all instrumentation

2. **Independent Docker Compose Stack**
   - Separate `docker-compose.signoz.yml` file
   - Not bundled with backend testing compose files
   - Can run independently or alongside other stacks
   - Uses dedicated Docker network with bridge to Prism

3. **Pre-configured Services**
   - Prism proxy: Auto-configured OTLP endpoint
   - Backend plugins: Environment variables for OTLP
   - Admin service: Traces and metrics enabled
   - Example applications: Sample instrumentation

4. **Data Persistence**
   - Volume mounts for ClickHouse data
   - Survives container restarts
   - Optional: Reset script to clear all data

5. **Resource Management**
   - Memory limit: 2GB total (1.5GB ClickHouse + 500MB services)
   - CPU limit: 2 cores
   - Port conflicts avoided (custom port range)

## Implementation Details

### Docker Compose Configuration

Location: `local-dev/signoz/docker-compose.signoz.yml`

version: '3.8'

services:

ClickHouse database for storing telemetry data

clickhouse: image: clickhouse/clickhouse-server:23.7-alpine container_name: prism-signoz-clickhouse volumes: - signoz-clickhouse-data:/var/lib/clickhouse - ./clickhouse-config.xml:/etc/clickhouse-server/config.d/logging.xml:ro environment: - CLICKHOUSE_DB=signoz ports: - "9001:9000" # Avoid conflict with MinIO (9000) - "8124:8123" # HTTP interface mem_limit: 1.5g cpus: 1.5 networks: - signoz healthcheck: test: ["CMD", "wget", "-q", "-O-", "http://localhost:8123/ping"] interval: 10s timeout: 5s retries: 3

Signoz Query Service (API + UI)

query-service: image: signoz/query-service:0.39.0 container_name: prism-signoz-query command: ["-config=/root/config/prometheus.yml"] volumes: - ./signoz-config.yaml:/root/config/prometheus.yml environment: - ClickHouseUrl=tcp://clickhouse:9000 - STORAGE=clickhouse ports: - "3301:8080" # Query Service API + UI depends_on: clickhouse: condition: service_healthy mem_limit: 256m cpus: 0.5 networks: - signoz healthcheck: test: ["CMD", "wget", "-q", "-O-", "http://localhost:8080/api/v1/health"] interval: 10s timeout: 5s retries: 3

OpenTelemetry Collector (OTLP receiver)

otel-collector: image: signoz/signoz-otel-collector:0.79.9 container_name: prism-signoz-otel command: ["--config=/etc/otel-collector-config.yaml"] volumes: - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml environment: - CLICKHOUSE_URL=clickhouse:9000 ports: - "4317:4317" # OTLP gRPC receiver - "4318:4318" # OTLP HTTP receiver depends_on: clickhouse: condition: service_healthy mem_limit: 256m cpus: 0.5 networks: - signoz - prism # Bridge to Prism components healthcheck: test: ["CMD", "wget", "-q", "-O-", "http://localhost:13133/"] interval: 10s timeout: 5s retries: 3

AlertManager (optional, for local alert testing)

alertmanager: image: signoz/alertmanager:0.23.4 container_name: prism-signoz-alertmanager volumes: - ./alertmanager-config.yaml:/etc/alertmanager/alertmanager.yml ports: - "9093:9093" mem_limit: 128m cpus: 0.25 networks: - signoz

volumes: signoz-clickhouse-data: driver: local

networks: signoz: driver: bridge prism: external: true # Connect to Prism components

### OpenTelemetry Collector Configuration

Location: `local-dev/signoz/otel-collector-config.yaml`

receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 cors: allowed_origins: - "http://localhost:" - "http://127.0.0.1:"

processors: batch: timeout: 1s send_batch_size: 1024

Add resource attributes for Prism components

resource: attributes: - key: deployment.environment value: "local" action: upsert - key: service.namespace value: "prism" action: upsert

Memory limiter to prevent OOM

memory_limiter: check_interval: 5s limit_mib: 256 spike_limit_mib: 64

exporters: clickhouse: endpoint: tcp://clickhouse:9000?database=signoz ttl: 72h # Keep data for 3 days in local dev

Debug exporter for troubleshooting

logging: loglevel: info sampling_initial: 5 sampling_thereafter: 200

service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [clickhouse]

metrics:
  receivers: [otlp]
  processors: [memory_limiter, resource, batch]
  exporters: [clickhouse]

logs:
  receivers: [otlp]
  processors: [memory_limiter, resource, batch]
  exporters: [clickhouse]

### Prism Proxy Integration

The proxy automatically detects and uses Signoz when available:

// proxy/src/observability/tracer.rs

pub fn init_tracer(config: &ObservabilityConfig) -> Result { // Check for Signoz OTLP endpoint let otlp_endpoint = env::var("OTEL_EXPORTER_OTLP_ENDPOINT") .unwrap_or_else(|_| "http://".to_string() + "localhost:4317");

let tracer = opentelemetry_otlp::new_pipeline()
    .tracing()
    .with_exporter(
        opentelemetry_otlp::new_exporter()
            .tonic()
            .with_endpoint(otlp_endpoint)
    )
    .with_trace_config(
        trace::config()
            .with_resource(Resource::new(vec![
                KeyValue::new("service.name", "prism-proxy"),
                KeyValue::new("service.version", env!("CARGO_PKG_VERSION")),
                KeyValue::new("deployment.environment", "local"),
            ]))
    )
    .install_batch(opentelemetry::runtime::Tokio)?;

Ok(tracer)

}

### Plugin Integration

Plugins receive OTLP configuration via environment variables:

// plugins/core/observability/tracer.go

func InitTracer(serviceName string) error { // Reads OTEL_EXPORTER_OTLP_ENDPOINT from environment exporter, err := otlptracegrpc.New( context.Background(), otlptracegrpc.WithInsecure(), otlptracegrpc.WithEndpoint(os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")), ) if err != nil { return err }

tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName(serviceName),
        semconv.DeploymentEnvironment("local"),
    )),
)

otel.SetTracerProvider(tp)
return nil

}

## Usage

### Starting Signoz

Start Signoz stack

cd local-dev/signoz docker-compose -f docker-compose.signoz.yml up -d

Wait for healthy state

docker-compose -f docker-compose.signoz.yml ps

Access UI

open http://localhost:3301

### Starting Prism with Signoz

Set OTLP endpoint environment variable

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Start Prism proxy

cd proxy cargo run --release

Start plugin with OTLP

cd plugins/postgres OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
go run ./cmd/server

### Viewing Traces

1. Open Signoz UI: http://localhost:3301
2. Navigate to "Traces" tab
3. Filter by service: `prism-proxy`, `prism-plugin-postgres`
4. Click trace to see full waterfall

### Resetting Data

Stop and remove volumes

docker-compose -f docker-compose.signoz.yml down -v

Restart fresh

docker-compose -f docker-compose.signoz.yml up -d

## Consequences

### Positive

1. **Comprehensive Observability**
   - Full visibility into distributed request flows
   - Correlate traces, metrics, and logs
   - Service dependency mapping

2. **Developer Productivity**
   - Faster debugging with trace waterfalls
   - Identify performance bottlenecks
   - Validate instrumentation locally

3. **Production Parity**
   - Same OpenTelemetry instrumentation local and production
   - Catch telemetry issues before deployment
   - Test alerting rules locally

4. **Low Friction**
   - Single command to start (`docker-compose up`)
   - Auto-discovery by Prism components
   - Pre-configured, zero setup

5. **Resource Efficient**
   - ~1.5GB RAM total
   - Fast queries with ClickHouse
   - Optional: disable when not needed

### Negative

1. **Additional Service to Manage**
   - One more Docker Compose stack
   - Requires keeping Signoz updated
   - Potential port conflicts (mitigated by custom ports)

2. **Learning Curve**
   - Developers need to understand Signoz UI
   - Query language for advanced filtering
   - Trace correlation concepts

3. **Data Cleanup**
   - ClickHouse data accumulates over time
   - Manual cleanup required (or automated scripts)

4. **Not a Full Production Solution**
   - Local instance for development only
   - Production needs scalable Signoz deployment
   - Different configuration for production

### Mitigation Strategies

1. **Documentation**: Comprehensive RFC-016 for setup and usage
2. **Scripts**: Automated start/stop/reset scripts
3. **Resource Limits**: Docker memory/CPU limits to prevent resource exhaustion
4. **Default Off**: Developers opt-in when needed (not running by default)

## Related Decisions

- [ADR-047: OpenTelemetry Tracing Integration](/adr/adr-047)
- [ADR-046: Dex IDP for Local Identity Testing](/adr/adr-046)
- [RFC-016: Local Development Infrastructure](/rfc/rfc-016) (this ADR's implementation guide)
- [RFC-008: Proxy Plugin Architecture](/rfc/rfc-008) (instrumentation points)

## References

- [Signoz Documentation](https://signoz.io/docs/)
- [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
- [ClickHouse](https://clickhouse.com/)
- [OTLP Protocol Specification](https://opentelemetry.io/docs/specs/otlp/)

## Revision History

- 2025-10-09: Initial decision for local Signoz instance

Status​

Context​

Problem Statement​

Why Signoz?​