MEMO-011: Distributed Error Handling Best Practices
Purpose
Document comprehensive error handling best practices for distributed systems and explain the design of Prism's enhanced error proto (prism.common.Error
).
Problem Statement
Basic error messages like Error { code: 500, message: "Internal error" }
are insufficient for distributed systems because:
- Lack of Context: No information about where/when/why the error occurred
- Not Actionable: Clients don't know if they should retry or give up
- Poor Observability: Can't categorize, aggregate, or alert on errors effectively
- Debugging Difficulty: Missing correlation IDs, trace context, and cause chains
- Backend Opacity: Can't distinguish Redis errors from Kafka errors
- No Remediation: No hints on how to fix the problem
Solution: Rich Structured Errors
Prism's Error
message captures distributed systems best practices from:
- Google's API error model (
google.rpc.Status
,ErrorInfo
,RetryInfo
) - Stripe's error design (detailed, actionable)
- AWS error patterns (retryable classification, throttling guidance)
- gRPC error handling (status codes + rich details)
Core Design Principles
1. Simple by Default, Rich When Needed
Basic error (minimal fields):
Error {
code: ERROR_CODE_NOT_FOUND
message: "Key 'user:12345' not found"
request_id: "req-abc123"
}
Rich error (with full context):
Error {
code: ERROR_CODE_BACKEND_ERROR
message: "Redis connection timeout"
request_id: "req-abc123"
category: ERROR_CATEGORY_BACKEND_ERROR
severity: ERROR_SEVERITY_ERROR
timestamp: { seconds: 1696867200 }
source: "prism-proxy-pod-3"
namespace: "user-profiles"
retry_policy: {
retryable: true
retry_after: { seconds: 5 }
max_retries: 3
backoff_strategy: BACKOFF_STRATEGY_EXPONENTIAL
backoff_multiplier: 2.0
}
details: [{
backend_error: {
backend_type: "redis"
backend_instance: "redis-master-1"
backend_error_code: "ETIMEDOUT"
backend_error_message: "Connection timeout after 5000ms"
operation: "GET"
pool_state: {
active_connections: 50
idle_connections: 0
max_connections: 50
wait_count: 12
wait_duration: { seconds: 3 }
}
}
}]
help_links: [{
title: "Troubleshooting Redis Timeouts"
url: "https://docs.prism.io/troubleshooting/redis-timeout"
link_type: "troubleshooting"
}]
}
2. Machine-Readable and Human-Readable
Machine-readable (for clients to handle programmatically):
code
- HTTP-style error codescategory
- Classification for metrics/alertingseverity
- Impact levelretry_policy
- Actionable retry guidance
Human-readable (for developers debugging):
message
- Clear English descriptionhelp_links
- Links to documentation/runbooksretry_policy.retry_advice
- Human-readable retry guidance
3. Error Chaining for Distributed Context
Errors can have causes (errors that led to this error):
Error {
code: ERROR_CODE_GATEWAY_TIMEOUT
message: "Pattern execution timed out"
source: "prism-proxy"
causes: [{
code: ERROR_CODE_BACKEND_ERROR
message: "Redis SET operation timed out"
source: "redis-plugin"
causes: [{
code: ERROR_CODE_NETWORK_ERROR
message: "Connection refused"
source: "redis-client"
}]
}]
}
This enables root cause analysis across service boundaries.
4. Retry Guidance
Clients shouldn't guess whether to retry. The error tells them:
RetryPolicy {
retryable: true // Yes, retry
retry_after: { seconds: 10 } // Wait 10 seconds
max_retries: 3 // Try up to 3 times
backoff_strategy: EXPONENTIAL // Use exponential backoff
backoff_multiplier: 2.0 // Double delay each time
retry_advice: "Backend is temporarily overloaded. Retry with exponential backoff."
}
Non-retryable errors:
RetryPolicy {
retryable: false
backoff_strategy: NEVER
retry_advice: "Key validation failed. Fix the key format and retry."
}
5. Structured Error Details
Use ErrorDetail
oneof for type-safe error information:
FieldViolation (validation errors)
field_violation: {
field: "ttl_seconds"
description: "TTL must be positive"
invalid_value: "-100"
constraint: "ttl_seconds > 0"
}
BackendError (backend-specific context)
backend_error: {
backend_type: "kafka"
backend_instance: "kafka-broker-2"
backend_error_code: "OFFSET_OUT_OF_RANGE"
backend_error_message: "Offset 12345 is out of range [0, 10000]"
operation: "CONSUME"
}
PatternError (pattern-level semantics)
pattern_error: {
pattern_type: "keyvalue"
interface_name: "KeyValueTTLInterface"
semantic_error: "TTL not supported by PostgreSQL backend"
supported_operations: ["Set", "Get", "Delete", "Exists"]
}
QuotaViolation (rate limiting)
quota_violation: {
dimension: "requests_per_second"
current: 1500
limit: 1000
reset_time: { seconds: 1696867260 } // 60 seconds from now
}
PreconditionFailure (CAS, version conflicts)
precondition_failure: {
type: "ETAG_MISMATCH"
field: "etag"
expected: "abc123"
actual: "def456"
description: "Key was modified by another client"
}
6. Error Categorization for Observability
ErrorCategory enables metrics aggregation:
CLIENT_ERROR
- User made a mistake (400-level)SERVER_ERROR
- Internal service failure (500-level)BACKEND_ERROR
- Backend storage issueNETWORK_ERROR
- Connectivity problemTIMEOUT_ERROR
- Operation took too longRATE_LIMIT_ERROR
- Quota exceededAUTHORIZATION_ERROR
- Permission deniedVALIDATION_ERROR
- Input validation failedRESOURCE_ERROR
- Resource not found/unavailableCONCURRENCY_ERROR
- Concurrent modification conflict
Prometheus metrics:
prism_errors_total{category="backend_error", backend="redis", code="503"} 42
prism_errors_total{category="timeout_error", pattern="keyvalue", code="504"} 12
prism_errors_total{category="rate_limit_error", namespace="prod", code="429"} 156
ErrorSeverity for prioritization:
DEBUG
- Informational, no action neededINFO
- Notable but expected (e.g., cache miss)WARNING
- Degraded but functionalERROR
- Operation failed, action may be neededCRITICAL
- Severe failure, immediate action required
7. Traceability and Correlation
Request ID: Correlate errors across services
request_id: "req-abc123-def456"
Source: Which service generated the error
source: "prism-proxy-pod-3"
Timestamp: When the error occurred
timestamp: { seconds: 1696867200 }
Debug Info (development only):
debug_info: {
trace_id: "abc123def456"
span_id: "span-789"
stack_entries: [
"at handleRequest (proxy.rs:123)",
"at executePattern (pattern.rs:456)",
"at redisGet (redis.rs:789)"
]
}
8. Batch Error Handling
For batch operations, use ErrorResponse
:
ErrorResponse {
error: {
code: ERROR_CODE_UNPROCESSABLE_ENTITY
message: "Batch operation partially failed"
}
partial_success: true
success_count: 8
failure_count: 2
item_errors: [
{
index: 3
item_id: "user:12345"
error: {
code: ERROR_CODE_NOT_FOUND
message: "Key not found"
}
},
{
index: 7
item_id: "user:67890"
error: {
code: ERROR_CODE_PRECONDITION_FAILED
message: "Version conflict"
}
}
]
}
Error Code Design
HTTP-Style Codes (Broad Compatibility)
2xx Success:
200 OK
- Success (shouldn't appear in errors)
4xx Client Errors (caller should fix):
400 BAD_REQUEST
- Invalid request syntax/parameters401 UNAUTHORIZED
- Authentication required403 FORBIDDEN
- Authenticated but not authorized404 NOT_FOUND
- Resource doesn't exist405 METHOD_NOT_ALLOWED
- Operation not supported409 CONFLICT
- Resource state conflict410 GONE
- Resource permanently deleted412 PRECONDITION_FAILED
- Precondition not met (CAS)413 PAYLOAD_TOO_LARGE
- Request exceeds size limits422 UNPROCESSABLE_ENTITY
- Validation failed429 TOO_MANY_REQUESTS
- Rate limit exceeded
5xx Server Errors (caller should retry):
500 INTERNAL_ERROR
- Unexpected internal error501 NOT_IMPLEMENTED
- Feature not implemented502 BAD_GATEWAY
- Upstream backend error503 SERVICE_UNAVAILABLE
- Temporarily unavailable504 GATEWAY_TIMEOUT
- Upstream timeout507 INSUFFICIENT_STORAGE
- Backend storage full
6xx Prism-Specific (custom errors):
600 BACKEND_ERROR
- Backend-specific error601 PATTERN_ERROR
- Pattern-level semantic error602 INTERFACE_NOT_SUPPORTED
- Backend doesn't implement interface603 SLOT_ERROR
- Pattern slot configuration error604 CIRCUIT_BREAKER_OPEN
- Circuit breaker preventing requests
Mapping to gRPC Status Codes
HTTP Code | gRPC Status | Description |
---|---|---|
400 | INVALID_ARGUMENT | Bad request |
401 | UNAUTHENTICATED | Missing auth |
403 | PERMISSION_DENIED | No permission |
404 | NOT_FOUND | Resource missing |
409 | ALREADY_EXISTS | Conflict |
412 | FAILED_PRECONDITION | Precondition failed |
429 | RESOURCE_EXHAUSTED | Rate limited |
500 | INTERNAL | Internal error |
501 | UNIMPLEMENTED | Not implemented |
503 | UNAVAILABLE | Service down |
504 | DEADLINE_EXCEEDED | Timeout |
Usage Examples
Example 1: Validation Error
return &Error{
Code: ErrorCode_ERROR_CODE_UNPROCESSABLE_ENTITY,
Message: "Invalid key format",
RequestId: requestID,
Category: ErrorCategory_ERROR_CATEGORY_VALIDATION_ERROR,
Severity: ErrorSeverity_ERROR_SEVERITY_ERROR,
Source: "prism-proxy",
Namespace: req.Namespace,
RetryPolicy: &RetryPolicy{
Retryable: false,
BackoffStrategy: BackoffStrategy_BACKOFF_STRATEGY_NEVER,
RetryAdvice: "Fix the key format and retry. Keys must match pattern: [a-zA-Z0-9:_-]+",
},
Details: []*ErrorDetail{{
Detail: &ErrorDetail_FieldViolation{
FieldViolation: &FieldViolation{
Field: "key",
Description: "Key contains invalid characters",
InvalidValue: "user@#$%",
Constraint: "key must match regex: [a-zA-Z0-9:_-]+",
},
},
}},
HelpLinks: []*ErrorLink{{
Title: "Key Naming Conventions",
Url: "https://docs.prism.io/keyvalue/naming",
LinkType: "documentation",
}},
}
Example 2: Backend Connection Pool Exhaustion
return &Error{
Code: ErrorCode_ERROR_CODE_SERVICE_UNAVAILABLE,
Message: "Redis connection pool exhausted",
RequestId: requestID,
Category: ErrorCategory_ERROR_CATEGORY_BACKEND_ERROR,
Severity: ErrorSeverity_ERROR_SEVERITY_CRITICAL,
Timestamp: timestamppb.Now(),
Source: "redis-plugin",
Namespace: req.Namespace,
RetryPolicy: &RetryPolicy{
Retryable: true,
RetryAfter: durationpb.New(10 * time.Second),
MaxRetries: 5,
BackoffStrategy: BackoffStrategy_BACKOFF_STRATEGY_EXPONENTIAL,
BackoffMultiplier: 2.0,
RetryAdvice: "Connection pool is full. Retry with exponential backoff.",
},
Details: []*ErrorDetail{{
Detail: &ErrorDetail_BackendError{
BackendError: &BackendError{
BackendType: "redis",
BackendInstance: "redis-master-1",
BackendErrorCode: "POOL_EXHAUSTED",
BackendErrorMessage: "All 50 connections in use",
Operation: "GET",
PoolState: &ConnectionPoolState{
ActiveConnections: 50,
IdleConnections: 0,
MaxConnections: 50,
WaitCount: 28,
WaitDuration: durationpb.New(5 * time.Second),
},
},
},
}},
HelpLinks: []*ErrorLink{{
Title: "Scaling Redis Connection Pools",
Url: "https://docs.prism.io/backends/redis/connection-pools",
LinkType: "documentation",
}},
}
Example 3: Interface Not Supported
return &Error{
Code: ErrorCode_ERROR_CODE_INTERFACE_NOT_SUPPORTED,
Message: "TTL operations not supported by PostgreSQL backend",
RequestId: requestID,
Category: ErrorCategory_ERROR_CATEGORY_CLIENT_ERROR,
Severity: ErrorSeverity_ERROR_SEVERITY_ERROR,
Source: "postgres-plugin",
Namespace: req.Namespace,
RetryPolicy: &RetryPolicy{
Retryable: false,
BackoffStrategy: BackoffStrategy_BACKOFF_STRATEGY_NEVER,
RetryAdvice: "Use Redis or DynamoDB for TTL support",
},
Details: []*ErrorDetail{{
Detail: &ErrorDetail_PatternError{
PatternError: &PatternError{
PatternType: "keyvalue",
InterfaceName: "KeyValueTTLInterface",
SemanticError: "PostgreSQL does not implement TTL interface",
SupportedOperations: []string{"Set", "Get", "Delete", "Exists", "BatchSet", "BatchGet"},
},
},
}},
HelpLinks: []*ErrorLink{{
Title: "Backend Interface Support Matrix",
Url: "https://docs.prism.io/backends/interface-matrix",
LinkType: "documentation",
}},
}
Example 4: Rate Limit Exceeded
return &Error{
Code: ErrorCode_ERROR_CODE_TOO_MANY_REQUESTS,
Message: "Rate limit exceeded for namespace 'user-profiles'",
RequestId: requestID,
Category: ErrorCategory_ERROR_CATEGORY_RATE_LIMIT_ERROR,
Severity: ErrorSeverity_ERROR_SEVERITY_WARNING,
Timestamp: timestamppb.Now(),
Source: "prism-proxy",
Namespace: "user-profiles",
RetryPolicy: &RetryPolicy{
Retryable: true,
RetryAfter: durationpb.New(60 * time.Second),
MaxRetries: 10,
BackoffStrategy: BackoffStrategy_BACKOFF_STRATEGY_LINEAR,
RetryAdvice: "Wait 60 seconds for quota reset",
},
Details: []*ErrorDetail{{
Detail: &ErrorDetail_QuotaViolation{
QuotaViolation: &QuotaViolation{
Dimension: "requests_per_second",
Current: 1500,
Limit: 1000,
ResetTime: timestamppb.New(time.Now().Add(60 * time.Second)),
},
},
}},
HelpLinks: []*ErrorLink{{
Title: "Rate Limiting Policies",
Url: "https://docs.prism.io/quotas-and-limits",
LinkType: "documentation",
}},
}
Error Handling Patterns
Pattern 1: Client Retry Logic
func retryWithBackoff(req *Request, maxRetries int) (*Response, error) {
var lastErr *Error
for attempt := 0; attempt <= maxRetries; attempt++ {
resp, err := client.Call(req)
if err == nil {
return resp, nil
}
// Extract Prism error
lastErr = extractPrismError(err)
// Check if retryable
if lastErr.RetryPolicy == nil || !lastErr.RetryPolicy.Retryable {
return nil, fmt.Errorf("non-retryable error: %s", lastErr.Message)
}
// Respect max retries from server
if lastErr.RetryPolicy.MaxRetries > 0 && attempt >= int(lastErr.RetryPolicy.MaxRetries) {
break
}
// Calculate backoff delay
delay := calculateBackoff(lastErr.RetryPolicy, attempt)
log.Warn("Retrying after error",
"attempt", attempt,
"error", lastErr.Message,
"delay", delay)
time.Sleep(delay)
}
return nil, fmt.Errorf("max retries exceeded: %s", lastErr.Message)
}
func calculateBackoff(policy *RetryPolicy, attempt int) time.Duration {
baseDelay := policy.RetryAfter.AsDuration()
switch policy.BackoffStrategy {
case BackoffStrategy_BACKOFF_STRATEGY_IMMEDIATE:
return 0
case BackoffStrategy_BACKOFF_STRATEGY_LINEAR:
return baseDelay * time.Duration(attempt+1)
case BackoffStrategy_BACKOFF_STRATEGY_EXPONENTIAL:
return baseDelay * time.Duration(math.Pow(policy.BackoffMultiplier, float64(attempt)))
case BackoffStrategy_BACKOFF_STRATEGY_JITTER:
exp := baseDelay * time.Duration(math.Pow(policy.BackoffMultiplier, float64(attempt)))
jitter := time.Duration(rand.Int63n(int64(exp / 2)))
return exp + jitter
default:
return baseDelay
}
}
Pattern 2: Structured Logging
func logError(err *Error) {
logger.Error("Request failed",
"code", err.Code.String(),
"category", err.Category.String(),
"severity", err.Severity.String(),
"message", err.Message,
"request_id", err.RequestId,
"source", err.Source,
"namespace", err.Namespace,
"timestamp", err.Timestamp.AsTime(),
"retryable", err.RetryPolicy != nil && err.RetryPolicy.Retryable,
)
// Log backend-specific details
for _, detail := range err.Details {
if backendErr := detail.GetBackendError(); backendErr != nil {
logger.Error("Backend error details",
"backend", backendErr.BackendType,
"instance", backendErr.BackendInstance,
"backend_code", backendErr.BackendErrorCode,
"operation", backendErr.Operation,
)
}
}
// Log cause chain
for i, cause := range err.Causes {
logger.Error("Error cause",
"depth", i+1,
"message", cause.Message,
"source", cause.Source,
)
}
}
Pattern 3: Prometheus Metrics
var (
errorCounter = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "prism_errors_total",
Help: "Total number of errors by category, code, and backend",
},
[]string{"category", "code", "backend", "namespace"},
)
errorSeverity = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "prism_errors_by_severity",
Help: "Errors by severity level",
},
[]string{"severity", "namespace"},
)
)
func recordError(err *Error) {
backend := extractBackendType(err)
errorCounter.WithLabelValues(
err.Category.String(),
strconv.Itoa(int(err.Code)),
backend,
err.Namespace,
).Inc()
errorSeverity.WithLabelValues(
err.Severity.String(),
err.Namespace,
).Inc()
}
Best Practices Summary
DO:
✅ Use structured error details (ErrorDetail
oneof)
✅ Provide retry guidance (RetryPolicy
)
✅ Chain errors across services (causes
)
✅ Include correlation IDs (request_id
)
✅ Add help links for common errors
✅ Set appropriate severity levels
✅ Populate backend context (BackendError
)
✅ Use semantic error codes (not just 500)
✅ Sanitize sensitive data from error messages
✅ Log errors with structured fields
DON'T:
❌ Return generic "Internal error" without context
❌ Expose internal implementation details to clients
❌ Use string error codes (use enums)
❌ Forget to set category
and severity
❌ Include stack traces in production responses
❌ Make all errors retryable (guide clients)
❌ Leak backend credentials or internal IPs
❌ Use HTTP status codes incorrectly
❌ Ignore error cause chains
❌ Skip setting namespace
for multi-tenant errors
Related Documents
- RFC-001: Prism Architecture - Overall architecture
- MEMO-006: Backend Interface Decomposition - Interface design
- Google API Error Model - Industry best practices
- gRPC Error Handling - gRPC status codes
Revision History
- 2025-10-10: Initial draft with comprehensive error proto design