MEMO-002: Admin Protocol Security Review and Improvements
Purpose
Comprehensive security and design review of RFC-010 (Admin Protocol with OIDC) to identify improvements, simplifications, and long-term extensibility concerns.
Status Update (2025-10-09)
✅ RECOMMENDATIONS IMPLEMENTED: All key recommendations from this security review have been incorporated into current RFCs and ADRs through the following commits:
Implementation History
Commit d6fb2b1 - "Add comprehensive documentation updates and new RFC-014" (2025-10-09 10:30 AM)
- ✅ Expanded RFC-010 open questions with multi-provider OIDC support (AWS Cognito, Azure AD, Google, Okta, Auth0, Dex)
- ✅ Added token caching strategies (24h default with JWKS caching and refresh token support)
- ✅ Added offline access validation with cached JWKS and security trade-offs
- ✅ Added multi-tenancy mapping options (group-based, claim-based, OPA policy, tenant-scoped)
- ✅ Added service account approaches with comparison table and best practices
Commit e50feb3 - "Add documentation-first memo, expand auth RFCs" (2025-10-09 12:17 PM)
- ✅ Expanded RFC-011 with comprehensive secrets provider abstraction (Vault, AWS Secrets Manager, Google Secret Manager, Azure Key Vault)
- ✅ Added credential management with automatic caching and renewal
- ✅ Added provider comparison matrix (dynamic credentials, auto-rotation, versioning, audit logging, cost)
- ✅ Created ADR-046 for Dex IDP as local OIDC provider for testing
- ✅ Added complete OIDC authentication section to RFC-006 with device code flow and token management
Recommendations Status
- ✅ Resource-Level Authorization: RFC-010 now includes namespace ownership, tagging, and ABAC policies
- ✅ Enhanced Audit Logging: Tamper-evident logging with chain hashing, signatures, and trace ID correlation documented in RFC-010
- ✅ API Versioning: Version negotiation endpoint and backward compatibility strategy added to RFC-010
- ✅ Adaptive Rate Limiting: Different quotas for read/write/expensive operations with burst handling documented in RFC-010
- ✅ Input Validation: Protobuf validation rules (protoc-gen-validate) added to RFC-010 with examples
- ✅ Session Management: Comprehensive open questions section in RFC-010 with multi-provider support, token caching, offline validation, and multi-tenancy mapping options
Summary
This memo now serves as a historical record of the security review process (conducted 2025-10-09 00:31 AM) that led to these improvements. All recommendations have been incorporated into RFC-010 (Admin Protocol with OIDC), RFC-011 (Data Proxy Authentication), RFC-006 (Python Admin CLI), and ADR-046 (Dex IDP for Local Testing) through commits made later the same day.
Executive Summary
Security Status: Generally solid OIDC-based authentication with room for improvement in authorization granularity, rate limiting, and audit trail completeness.
Key Recommendations:
- Add request-level resource authorization (not just method-level)
- Implement structured audit logging with tamper-evident storage
- Add API versioning to support long-term evolution
- Simplify session management (remove ambiguity)
- Add request signing for critical operations
- Implement comprehensive input validation
Security Analysis
1. Authentication (✅ Strong)
Current State:
- OIDC with JWT validation
- Device code flow for CLI
- Public key validation via JWKS
Issues: None critical
Recommendations:
+ Add JWT revocation checking (check against revocation list)
+ Add token binding to prevent token theft
+ Implement short-lived JWTs (5-15 min) with refresh tokens
Improvement:
pub struct JwtValidator {
issuer: String,
audience: String,
jwks_client: JwksClient,
+ revocation_checker: Arc<RevocationChecker>, // NEW
+ max_token_age: Duration, // NEW
}
impl JwtValidator {
pub async fn validate_token(&self, token: &str) -> Result<Claims> {
let token_data = decode::<Claims>(token, &decoding_key, &validation)?;
+ // Check revocation list
+ if self.revocation_checker.is_revoked(&token_data.claims.jti).await? {
+ return Err(Error::TokenRevoked);
+ }
+
+ // Enforce max token age
+ let token_age = Utc::now().timestamp() - token_data.claims.iat as i64;
+ if token_age > self.max_token_age.as_secs() as i64 {
+ return Err(Error::TokenTooOld);
+ }
Ok(token_data.claims)
}
}
2. Authorization (⚠️ Needs Improvement)
Current State:
- Method-level RBAC (e.g.,
admin:writefor CreateNamespace) - Three roles: admin, operator, viewer
Issues:
- No resource-level authorization: User with
admin:writecan modify ANY namespace - No attribute-based access control (ABAC): Can't restrict by namespace owner, tags, etc.
- Coarse-grained permissions: Can't delegate specific operations
Improvement:
// Add resource-level authorization to requests
message CreateNamespaceRequest {
string name = 1;
string description = 2;
// NEW: Resource ownership and tagging
string owner = 3; // User/team that owns this namespace
repeated string tags = 4; // For ABAC policies (e.g., "prod", "staging")
map<string, string> labels = 5; // Key-value metadata
}
// Authorization check becomes:
// 1. Does user have admin:write permission?
// 2. Is user allowed to create namespaces with owner=X?
// 3. Is user allowed to create namespaces with tags=[prod]?
RBAC Policy Enhancement:
roles:
namespace-admin:
description: Can manage namespaces they own
permissions:
- admin:read
- admin:write:namespace:owned # NEW: Scoped permission
team-lead:
description: Can manage team's namespaces
permissions:
- admin:read
- admin:write:namespace:team:* # NEW: Wildcard for team namespaces
policies:
- name: namespace-ownership
effect: allow
principals:
- role:namespace-admin
actions:
- CreateNamespace
- UpdateNamespace
- DeleteNamespace
resources:
- namespace:${claims.email}/* # Can only manage own namespaces
- name: production-lockdown
effect: deny
principals:
- role:developer
actions:
- DeleteNamespace
resources:
- namespace:*/tags:prod # Cannot delete prod namespaces
3. Audit Logging (⚠️ Needs Improvement)
Current State:
- Basic audit log with actor, operation, resource
- Stored in Postgres
Issues:
- Not tamper-evident: Admin with DB access can modify audit log
- No log signing: Can't verify log integrity
- Missing context: No client IP, user agent, request ID correlation
- No retention policy: Logs could grow unbounded
Improvement:
#[derive(Debug, Serialize)]
pub struct AuditLogEntry {
pub id: Uuid,
pub timestamp: DateTime<Utc>,
// Identity
pub actor: String,
pub actor_groups: Vec<String>,
+ pub actor_ip: IpAddr, // NEW
+ pub user_agent: Option<String>, // NEW
// Operation
pub operation: String,
pub resource_type: String,
pub resource_id: String,
pub namespace: Option<String>,
pub request_id: Option<String>,
+ pub trace_id: Option<String>, // NEW: OpenTelemetry trace ID
// Result
pub success: bool,
pub error: Option<String>,
+ pub duration_ms: u64, // NEW
+ pub status_code: u32, // NEW: gRPC status code
// Security
pub metadata: serde_json::Value,
+ pub signature: String, // NEW: HMAC signature
+ pub chain_hash: String, // NEW: Hash of previous log entry
}
impl AuditLogger {
pub async fn log_entry(&self, entry: AuditLogEntry) -> Result<()> {
// Sign the entry
let signature = self.sign_entry(&entry)?;
// Chain to previous entry (tamper-evident)
let prev_hash = self.get_last_entry_hash().await?;
let chain_hash = self.compute_chain_hash(&entry, &prev_hash)?;
let signed_entry = SignedAuditLogEntry {
entry,
signature,
chain_hash,
};
// Write to append-only log
self.store.append(signed_entry).await?;
// Also send to external SIEM (defense in depth)
self.siem_exporter.export(signed_entry).await?;
Ok(())
}
}
Storage:
CREATE TABLE admin_audit_log (
id UUID PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL,
actor VARCHAR(255) NOT NULL,
actor_groups TEXT[] NOT NULL,
+ actor_ip INET NOT NULL,
+ user_agent TEXT,
operation VARCHAR(255) NOT NULL,
resource_type VARCHAR(100) NOT NULL,
resource_id VARCHAR(255) NOT NULL,
namespace VARCHAR(255),
request_id VARCHAR(100),
+ trace_id VARCHAR(100),
success BOOLEAN NOT NULL,
error TEXT,
+ duration_ms BIGINT NOT NULL,
+ status_code INT NOT NULL,
metadata JSONB,
+ signature VARCHAR(512) NOT NULL,
+ chain_hash VARCHAR(128) NOT NULL,
INDEX idx_audit_timestamp ON admin_audit_log(timestamp DESC),
INDEX idx_audit_actor ON admin_audit_log(actor),
INDEX idx_audit_operation ON admin_audit_log(operation),
INDEX idx_audit_namespace ON admin_audit_log(namespace),
+ INDEX idx_audit_trace_id ON admin_audit_log(trace_id)
);
-- Append-only table (prevent updates/deletes)
CREATE TRIGGER audit_log_immutable
BEFORE UPDATE OR DELETE ON admin_audit_log
FOR EACH ROW
EXECUTE FUNCTION prevent_modification();
4. Rate Limiting (⚠️ Needs Improvement)
Current State:
- 100 requests per minute per user
- No distinction between read/write operations
Issues:
- Too coarse: Should differentiate between expensive and cheap operations
- No burst handling: 100 req/min = ~1.6 req/sec, doesn't allow bursts
- No per-operation limits: Can spam expensive operations
Improvement:
pub struct AdaptiveRateLimiter {
// Different quotas for different operation types
read_limiter: RateLimiter<String>, // 1000 req/min
write_limiter: RateLimiter<String>, // 100 req/min
expensive_limiter: RateLimiter<String>, // 10 req/min (e.g., ListSessions)
// Burst allowance
burst_quota: NonZeroU32,
}
impl AdaptiveRateLimiter {
pub async fn check(&self, claims: &Claims, operation: &str) -> Result<(), Status> {
let key = &claims.email;
let limiter = match operation {
// Expensive operations (database scans, aggregations)
"ListSessions" | "GetMetrics" | "ExportMetrics" => &self.expensive_limiter,
// Write operations (create, update, delete)
op if op.starts_with("Create") || op.starts_with("Update")
|| op.starts_with("Delete") => &self.write_limiter,
// Read operations (get, list, describe)
_ => &self.read_limiter,
};
if limiter.check_key(key).is_err() {
return Err(Status::resource_exhausted(format!(
"Rate limit exceeded for {} (operation: {})",
claims.email, operation
)));
}
Ok(())
}
}