MEMO-002: Admin Protocol Security Review and Improvements
Purpose
Comprehensive security and design review of RFC-010 (Admin Protocol with OIDC) to identify improvements, simplifications, and long-term extensibility concerns.
Status Update (2025-10-09)
✅ RECOMMENDATIONS IMPLEMENTED: All key recommendations from this security review have been incorporated into current RFCs and ADRs through the following commits:
Implementation History
Commit d6fb2b1 - "Add comprehensive documentation updates and new RFC-014" (2025-10-09 10:30 AM)
- ✅ Expanded RFC-010 open questions with multi-provider OIDC support (AWS Cognito, Azure AD, Google, Okta, Auth0, Dex)
- ✅ Added token caching strategies (24h default with JWKS caching and refresh token support)
- ✅ Added offline access validation with cached JWKS and security trade-offs
- ✅ Added multi-tenancy mapping options (group-based, claim-based, OPA policy, tenant-scoped)
- ✅ Added service account approaches with comparison table and best practices
Commit e50feb3 - "Add documentation-first memo, expand auth RFCs" (2025-10-09 12:17 PM)
- ✅ Expanded RFC-011 with comprehensive secrets provider abstraction (Vault, AWS Secrets Manager, Google Secret Manager, Azure Key Vault)
- ✅ Added credential management with automatic caching and renewal
- ✅ Added provider comparison matrix (dynamic credentials, auto-rotation, versioning, audit logging, cost)
- ✅ Created ADR-046 for Dex IDP as local OIDC provider for testing
- ✅ Added complete OIDC authentication section to RFC-006 with device code flow and token management
Recommendations Status
- ✅ Resource-Level Authorization: RFC-010 now includes namespace ownership, tagging, and ABAC policies
- ✅ Enhanced Audit Logging: Tamper-evident logging with chain hashing, signatures, and trace ID correlation documented in RFC-010
- ✅ API Versioning: Version negotiation endpoint and backward compatibility strategy added to RFC-010
- ✅ Adaptive Rate Limiting: Different quotas for read/write/expensive operations with burst handling documented in RFC-010
- ✅ Input Validation: Protobuf validation rules (protoc-gen-validate) added to RFC-010 with examples
- ✅ Session Management: Comprehensive open questions section in RFC-010 with multi-provider support, token caching, offline validation, and multi-tenancy mapping options
Summary
This memo now serves as a historical record of the security review process (conducted 2025-10-09 00:31 AM) that led to these improvements. All recommendations have been incorporated into RFC-010 (Admin Protocol with OIDC), RFC-011 (Data Proxy Authentication), RFC-006 (Python Admin CLI), and ADR-046 (Dex IDP for Local Testing) through commits made later the same day.
Executive Summary
Security Status: Generally solid OIDC-based authentication with room for improvement in authorization granularity, rate limiting, and audit trail completeness.
Key Recommendations:
- Add request-level resource authorization (not just method-level)
- Implement structured audit logging with tamper-evident storage
- Add API versioning to support long-term evolution
- Simplify session management (remove ambiguity)
- Add request signing for critical operations
- Implement comprehensive input validation
Security Analysis
1. Authentication (✅ Strong)
Current State:
- OIDC with JWT validation
- Device code flow for CLI
- Public key validation via JWKS
Issues: None critical
Recommendations:
+ Add JWT revocation checking (check against revocation list)
+ Add token binding to prevent token theft
+ Implement short-lived JWTs (5-15 min) with refresh tokens
Improvement:
pub struct JwtValidator {
issuer: String,
audience: String,
jwks_client: JwksClient,
+ revocation_checker: Arc<RevocationChecker>, // NEW
+ max_token_age: Duration, // NEW
}
impl JwtValidator {
pub async fn validate_token(&self, token: &str) -> Result<Claims> {
let token_data = decode::<Claims>(token, &decoding_key, &validation)?;
+ // Check revocation list
+ if self.revocation_checker.is_revoked(&token_data.claims.jti).await? {
+ return Err(Error::TokenRevoked);
+ }
+
+ // Enforce max token age
+ let token_age = Utc::now().timestamp() - token_data.claims.iat as i64;
+ if token_age > self.max_token_age.as_secs() as i64 {
+ return Err(Error::TokenTooOld);
+ }
Ok(token_data.claims)
}
}
2. Authorization (⚠️ Needs Improvement)
Current State:
- Method-level RBAC (e.g.,
admin:write
for CreateNamespace) - Three roles: admin, operator, viewer
Issues:
- No resource-level authorization: User with
admin:write
can modify ANY namespace - No attribute-based access control (ABAC): Can't restrict by namespace owner, tags, etc.
- Coarse-grained permissions: Can't delegate specific operations
Improvement:
// Add resource-level authorization to requests
message CreateNamespaceRequest {
string name = 1;
string description = 2;
// NEW: Resource ownership and tagging
string owner = 3; // User/team that owns this namespace
repeated string tags = 4; // For ABAC policies (e.g., "prod", "staging")
map<string, string> labels = 5; // Key-value metadata
}
// Authorization check becomes:
// 1. Does user have admin:write permission?
// 2. Is user allowed to create namespaces with owner=X?
// 3. Is user allowed to create namespaces with tags=[prod]?
RBAC Policy Enhancement:
roles:
namespace-admin:
description: Can manage namespaces they own
permissions:
- admin:read
- admin:write:namespace:owned # NEW: Scoped permission
team-lead:
description: Can manage team's namespaces
permissions:
- admin:read
- admin:write:namespace:team:* # NEW: Wildcard for team namespaces
policies:
- name: namespace-ownership
effect: allow
principals:
- role:namespace-admin
actions:
- CreateNamespace
- UpdateNamespace
- DeleteNamespace
resources:
- namespace:${claims.email}/* # Can only manage own namespaces
- name: production-lockdown
effect: deny
principals:
- role:developer
actions:
- DeleteNamespace
resources:
- namespace:*/tags:prod # Cannot delete prod namespaces
3. Audit Logging (⚠️ Needs Improvement)
Current State:
- Basic audit log with actor, operation, resource
- Stored in Postgres
Issues:
- Not tamper-evident: Admin with DB access can modify audit log
- No log signing: Can't verify log integrity
- Missing context: No client IP, user agent, request ID correlation
- No retention policy: Logs could grow unbounded
Improvement:
#[derive(Debug, Serialize)]
pub struct AuditLogEntry {
pub id: Uuid,
pub timestamp: DateTime<Utc>,
// Identity
pub actor: String,
pub actor_groups: Vec<String>,
+ pub actor_ip: IpAddr, // NEW
+ pub user_agent: Option<String>, // NEW
// Operation
pub operation: String,
pub resource_type: String,
pub resource_id: String,
pub namespace: Option<String>,
pub request_id: Option<String>,
+ pub trace_id: Option<String>, // NEW: OpenTelemetry trace ID
// Result
pub success: bool,
pub error: Option<String>,
+ pub duration_ms: u64, // NEW
+ pub status_code: u32, // NEW: gRPC status code
// Security
pub metadata: serde_json::Value,
+ pub signature: String, // NEW: HMAC signature
+ pub chain_hash: String, // NEW: Hash of previous log entry
}
impl AuditLogger {
pub async fn log_entry(&self, entry: AuditLogEntry) -> Result<()> {
// Sign the entry
let signature = self.sign_entry(&entry)?;
// Chain to previous entry (tamper-evident)
let prev_hash = self.get_last_entry_hash().await?;
let chain_hash = self.compute_chain_hash(&entry, &prev_hash)?;
let signed_entry = SignedAuditLogEntry {
entry,
signature,
chain_hash,
};
// Write to append-only log
self.store.append(signed_entry).await?;
// Also send to external SIEM (defense in depth)
self.siem_exporter.export(signed_entry).await?;
Ok(())
}
}
Storage:
CREATE TABLE admin_audit_log (
id UUID PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL,
actor VARCHAR(255) NOT NULL,
actor_groups TEXT[] NOT NULL,
+ actor_ip INET NOT NULL,
+ user_agent TEXT,
operation VARCHAR(255) NOT NULL,
resource_type VARCHAR(100) NOT NULL,
resource_id VARCHAR(255) NOT NULL,
namespace VARCHAR(255),
request_id VARCHAR(100),
+ trace_id VARCHAR(100),
success BOOLEAN NOT NULL,
error TEXT,
+ duration_ms BIGINT NOT NULL,
+ status_code INT NOT NULL,
metadata JSONB,
+ signature VARCHAR(512) NOT NULL,
+ chain_hash VARCHAR(128) NOT NULL,
INDEX idx_audit_timestamp ON admin_audit_log(timestamp DESC),
INDEX idx_audit_actor ON admin_audit_log(actor),
INDEX idx_audit_operation ON admin_audit_log(operation),
INDEX idx_audit_namespace ON admin_audit_log(namespace),
+ INDEX idx_audit_trace_id ON admin_audit_log(trace_id)
);
-- Append-only table (prevent updates/deletes)
CREATE TRIGGER audit_log_immutable
BEFORE UPDATE OR DELETE ON admin_audit_log
FOR EACH ROW
EXECUTE FUNCTION prevent_modification();
4. Rate Limiting (⚠️ Needs Improvement)
Current State:
- 100 requests per minute per user
- No distinction between read/write operations
Issues:
- Too coarse: Should differentiate between expensive and cheap operations
- No burst handling: 100 req/min = ~1.6 req/sec, doesn't allow bursts
- No per-operation limits: Can spam expensive operations
Improvement:
pub struct AdaptiveRateLimiter {
// Different quotas for different operation types
read_limiter: RateLimiter<String>, // 1000 req/min
write_limiter: RateLimiter<String>, // 100 req/min
expensive_limiter: RateLimiter<String>, // 10 req/min (e.g., ListSessions)
// Burst allowance
burst_quota: NonZeroU32,
}
impl AdaptiveRateLimiter {
pub async fn check(&self, claims: &Claims, operation: &str) -> Result<(), Status> {
let key = &claims.email;
let limiter = match operation {
// Expensive operations (database scans, aggregations)
"ListSessions" | "GetMetrics" | "ExportMetrics" => &self.expensive_limiter,
// Write operations (create, update, delete)
op if op.starts_with("Create") || op.starts_with("Update")
|| op.starts_with("Delete") => &self.write_limiter,
// Read operations (get, list, describe)
_ => &self.read_limiter,
};
if limiter.check_key(key).is_err() {
return Err(Status::resource_exhausted(format!(
"Rate limit exceeded for {} (operation: {})",
claims.email, operation
)));
}
Ok(())
}
}
5. Input Validation (⚠️ Missing)
Current State:
- No explicit validation in protobuf
- Relies on application logic
Issues:
- No length limits: Namespace names, descriptions could be arbitrarily long
- No format validation: Email, URLs, identifiers unchecked
- No sanitization: Potential for injection attacks in metadata
Improvement:
message CreateNamespaceRequest {
string name = 1 [
(validate.rules).string = {
min_len: 3
max_len: 63
pattern: "^[a-z0-9]([a-z0-9-]*[a-z0-9])?$" // DNS-like naming
}
];
string description = 2 [
(validate.rules).string = {
max_len: 500
}
];
string owner = 3 [
(validate.rules).string = {
email: true // Validate email format
}
];
repeated string tags = 4 [
(validate.rules).repeated = {
max_items: 10
items: {
string: {
min_len: 1
max_len: 50
pattern: "^[a-z0-9-]+$"
}
}
}
];
map<string, string> labels = 5 [
(validate.rules).map = {
max_pairs: 20
keys: {
string: {
min_len: 1
max_len: 63
pattern: "^[a-z0-9]([a-z0-9-]*[a-z0-9])?$"
}
}
values: {
string: {
max_len: 255
}
}
}
];
}
Validation middleware:
use validator::Validate;
pub struct ValidationInterceptor;
impl ValidationInterceptor {
pub async fn intercept<T: Validate>(&self, req: Request<T>) -> Result<Request<T>, Status> {
// Validate request using protoc-gen-validate
req.get_ref().validate()
.map_err(|e| Status::invalid_argument(format!("Validation error: {}", e)))?;
Ok(req)
}
}
6. API Versioning (❌ Missing)
Current State:
- No versioning in package name:
prism.admin.v1
- No version negotiation
Issues:
- Breaking changes: How to evolve protocol without breaking clients?
- Deprecation: No way to deprecate old endpoints
- Feature flags: No way to opt-in to new features
Improvement:
// Package with explicit version
package prism.admin.v2;
// Version negotiation
message GetVersionRequest {}
message GetVersionResponse {
int32 api_version = 1; // Current version: 2
int32 min_supported = 2; // Minimum supported: 1
repeated int32 supported = 3; // Supported versions: [1, 2]
// Feature flags for gradual rollout
map<string, bool> features = 4; // e.g., {"shadow-traffic": true}
}
service AdminService {
// Version negotiation
rpc GetVersion(GetVersionRequest) returns (GetVersionResponse);
// Versioned operations (with backward compatibility)
rpc CreateNamespace(CreateNamespaceRequest) returns (CreateNamespaceResponse);
rpc CreateNamespaceV2(CreateNamespaceV2Request) returns (CreateNamespaceV2Response);
}
7. Request Signing (❌ Missing)
Current State:
- No request integrity protection beyond TLS
- No replay attack prevention
Issues:
- Token theft: Stolen JWT can be used until expiry
- Replay attacks: Captured requests can be replayed
- Man-in-the-middle: TLS protects transport, but not request integrity
Improvement:
message RequestMetadata {
string timestamp = 1; // ISO 8601 timestamp
string nonce = 2; // Random nonce for replay prevention
string signature = 3; // HMAC-SHA256(timestamp + nonce + request_body, jwt_secret)
}
// All requests include metadata
message CreateNamespaceRequest {
RequestMetadata metadata = 1;
string name = 2;
string description = 3;
// ... other fields
}
Signature verification:
pub struct SignatureVerifier {
nonce_cache: Arc<NonceCache>, // Redis-based cache
max_request_age: Duration, // 5 minutes
}
impl SignatureVerifier {
pub async fn verify(&self, req: &CreateNamespaceRequest, claims: &Claims) -> Result<()> {
let metadata = req.metadata.as_ref()
.ok_or(Error::MissingMetadata)?;
// Check timestamp freshness
let timestamp = DateTime::parse_from_rfc3339(&metadata.timestamp)?;
let age = Utc::now() - timestamp;
if age > self.max_request_age {
return Err(Error::RequestTooOld);
}
// Check nonce uniqueness (prevent replay)
if self.nonce_cache.exists(&metadata.nonce).await? {
return Err(Error::NonceReused);
}
self.nonce_cache.insert(&metadata.nonce, age).await?;
// Verify signature
let expected_signature = self.compute_signature(
&metadata.timestamp,
&metadata.nonce,
req,
&claims.sub,
)?;
if metadata.signature != expected_signature {
return Err(Error::InvalidSignature);
}
Ok(())
}
}
Simplification Recommendations
1. Consolidate Session Operations
Current: Separate GetSession, DescribeSession, ListSessions
Simplified:
message GetSessionsRequest {
// Filters (all optional)
string namespace = 1;
string session_id = 2; // If specified, returns single session
SessionStatus status = 3;
// Pagination
int32 page_size = 10;
string page_token = 11;
// Include detailed info?
bool include_details = 20;
}
message GetSessionsResponse {
repeated Session sessions = 1;
string next_page_token = 2;
}
service AdminService {
// Single endpoint replaces GetSession, DescribeSession, ListSessions
rpc GetSessions(GetSessionsRequest) returns (GetSessionsResponse);
rpc TerminateSession(TerminateSessionRequest) returns (TerminateSessionResponse);
}
2. Unify Config Operations
Current: ListConfigs, GetConfig, CreateConfig, UpdateConfig, DeleteConfig
Simplified:
service AdminService {
// Read configs (supports filtering, pagination)
rpc GetConfigs(GetConfigsRequest) returns (GetConfigsResponse);
// Write config (upsert: create or update)
rpc PutConfig(PutConfigRequest) returns (PutConfigResponse);
// Delete config
rpc DeleteConfig(DeleteConfigRequest) returns (DeleteConfigResponse);
}
3. Standardize Pagination
Current: Inconsistent pagination across endpoints
Improved:
// Standard pagination pattern for all list operations
message PaginationRequest {
int32 page_size = 1 [
(validate.rules).int32 = {
gte: 1
lte: 1000
}
];
string page_token = 2;
}
message PaginationResponse {
string next_page_token = 1;
int32 total_count = 2; // Optional: total count for UI
}
// Apply to all list operations
message ListNamespacesRequest {
PaginationRequest pagination = 1;
// ... filters
}
message ListNamespacesResponse {
repeated Namespace namespaces = 1;
PaginationResponse pagination = 2;
}
Long-Term Extensibility
1. Batch Operations
For automation and efficiency:
message BatchCreateNamespacesRequest {
repeated CreateNamespaceRequest requests = 1 [
(validate.rules).repeated = {
min_items: 1
max_items: 100
}
];
// Fail fast or continue on error?
bool atomic = 2; // If true, rollback all on any failure
}
message BatchCreateNamespacesResponse {
repeated CreateNamespaceResponse responses = 1;
repeated Error errors = 2; // Errors for failed requests
}
2. Watch/Subscribe for Real-Time Updates
For UI and automation:
message WatchNamespacesRequest {
// Filters
string namespace_prefix = 1;
repeated string tags = 2;
// Watch from specific point
string resource_version = 3; // Resume from last seen version
}
message WatchNamespacesResponse {
enum EventType {
ADDED = 0;
MODIFIED = 1;
DELETED = 2;
}
EventType type = 1;
Namespace namespace = 2;
string resource_version = 3; // For resuming watch
}
service AdminService {
// Server streaming for real-time updates
rpc WatchNamespaces(WatchNamespacesRequest) returns (stream WatchNamespacesResponse);
}
3. Query Language for Complex Filters
For advanced filtering:
message QueryRequest {
// SQL-like or JMESPath query
string query = 1 [
(validate.rules).string = {
max_len: 1000
}
];
// Example: "SELECT * FROM namespaces WHERE tags CONTAINS 'prod' AND created_at > '2025-01-01'"
// Or: "namespaces[?tags.contains('prod') && created_at > '2025-01-01']"
}
Implementation Priority
Phase 1: Security Hardening (Week 1-2)
- Add input validation with protoc-gen-validate
- Implement resource-level authorization
- Add audit log signing and tamper-evidence
- Implement adaptive rate limiting
Phase 2: Simplifications (Week 3)
- Consolidate session and config operations
- Standardize pagination across all endpoints
Phase 3: Extensibility (Week 4-5)
- Add API versioning support
- Implement batch operations
- Add watch/subscribe for real-time updates
Phase 4: Advanced (Future)
- Add request signing for critical operations
- Implement query language for complex filters
Conclusion
Security Grade: B+ (Good, with room for improvement)
Key Wins:
- Strong OIDC-based authentication
- Proper JWT validation
- Audit logging foundation
- Rate limiting baseline
Must-Fix:
- Add resource-level authorization
- Implement tamper-evident audit logging
- Add input validation
- Implement API versioning
Nice-to-Have:
- Request signing
- Batch operations
- Watch/Subscribe
- Query language
Next Steps:
- Review this memo with team
- Prioritize improvements
- Create implementation ADRs for each phase
- Update RFC-010 with accepted improvements