Client-Originated Configuration
Context
Traditional data infrastructure requires manual provisioning:
- Application team estimates data requirements
- DBA provisions database cluster
- Application team configures connection details
- Capacity is often wrong (over or under-provisioned)
- Changes require coordination between teams
Netflix's Data Gateway improves this with declarative deployment configuration, but still requires infrastructure team involvement to map capacity requirements to hardware.
Problem: Manual capacity planning is slow, error-prone, and creates bottlenecks.
Decision
Implement client-originated configuration where applications declare their data access patterns in protobuf definitions, and Prism automatically:
- Selects optimal backend storage engine
- Calculates capacity requirements
- Provisions infrastructure
- Configures connections and policies
Rationale
How It Works
Applications define data models with annotations:
message UserEvents {
string user_id = 1 [(prism.index) = "partition_key"];
bytes event_data = 2;
int64 timestamp = 3 [(prism.index) = "clustering_key"];
option (prism.access_pattern) = "append_heavy"; // 95% writes, 5% reads
option (prism.estimated_write_rps) = "10000"; // Peak writes/sec
option (prism.estimated_read_rps) = "500"; // Peak reads/sec
option (prism.data_size_estimate_mb) = "1000"; // Total data size
option (prism.retention_days) = "90"; // Auto-delete old data
option (prism.consistency) = "eventual"; // Consistency requirement
option (prism.latency_p99_ms) = "10"; // Latency SLO
}
Prism's capacity planner:
- Analyzes access pattern: "append_heavy" → Kafka is ideal
- Calculates partition count: 10k writes/sec → 20 partitions (500 writes/partition/sec)
- Provisions cluster: Creates Kafka cluster with appropriate instance types
- Configures retention: Sets 90-day retention policy
- Sets up monitoring: Alerts if P99 > 10ms or RPS exceeds 10k
Benefits Over Manual Provisioning
Aspect | Manual | Client-Originated |
---|---|---|
Time to provision | Days/weeks | Minutes |
Accuracy | Often wrong | Data-driven |
Ownership | Split (app + infra teams) | Clear (app team) |
Scaling | Manual requests | Automatic |
Cost optimization | Ad-hoc | Continuous |
Alternatives Considered
-
Manual Provisioning (traditional approach)
- Pros:
- Full control
- Familiar to ops teams
- Cons:
- Slow (days/weeks)
- Error-prone
- Creates bottlenecks
- Scales poorly (1 DBA : N teams)
- Rejected because: Doesn't scale as org grows
- Pros:
-
Declarative Deployment Config (Netflix's approach)
- Pros:
- Better than manual
- Infrastructure as code
- Version controlled
- Cons:
- Still requires capacity planning expertise
- Separate from application code
- Changes require infra team review
- Rejected because: Still creates coordination overhead
- Pros:
-
Fully Automatic (no application hints)
- Pros:
- Zero configuration burden
- Ultimate simplicity
- Cons:
- Cannot optimize for known patterns
- Over-provisions to be safe
- Higher costs
- Rejected because: Loses optimization opportunities
- Pros:
-
Runtime Metrics-Based (scale based on observed load)
- Pros:
- Responds to actual usage
- No estimation needed
- Cons:
- Reactive not proactive
- Poor for spiky workloads
- Doesn't help initial provisioning
- Rejected because: Can be combined with client-originated config for continuous optimization
- Pros:
Consequences
Positive
- Faster Development: No waiting for database provisioning
- Self-Service: Application teams are empowered
- Accurate Capacity: Based on actual requirements, not guesses
- Cost Optimization: Right-sized infrastructure from day one
- Living Documentation: Protobuf definitions document requirements
- Easier Migrations: Change
option (prism.backend) = "postgres"
to"kafka"
and redeploy - Organizational Scalability: Infrastructure team doesn't become bottleneck as company grows
Negative
- More Complex Tooling: Capacity planner must be sophisticated
- Mitigation: Start with conservative heuristics; refine over time
- Protobuf Coupling: Configuration embedded in data models
- Mitigation: This is intentional; keeps requirements close to code
- Requires Estimation: Teams must estimate RPS, data size
- Mitigation: Provide estimation tools; Prism adapts based on actual metrics
- Configuration Authority: Need authorization boundaries to prevent misuse
- Mitigation: Policy-driven configuration limits (see Organizational Scalability section)
Neutral
- Shifts Responsibility: From infra team to app teams
- Some teams will prefer this (autonomy)
- Others may miss having an expert provision for them
- Plan: Provide templates and examples for common patterns
Organizational Scalability and Authorization Boundaries
The Scalability Challenge
As organizations grow, traditional manual provisioning breaks down:
Organization Size | Manual Provisioning Model | Bottleneck |
---|---|---|
Startup (1-5 teams) | 1 DBA provisions all databases | Works initially |
Growing (10-20 teams) | 2-3 DBAs, ticket queue | 1-2 week delays |
Scale (50+ teams) | 5-10 DBAs, complex approval process | 2-4 week delays, team burnout |
Large (500+ teams) | 20+ DBAs, dedicated infrastructure org | Infrastructure team larger than feature teams |
Client-originated configuration solves this: Infrastructure team size remains constant (maintain Prism platform) while application teams scale linearly.
Key Insight: Client configurability is essential for organizational scalability, but requires authorization boundaries to prevent misuse.
Authorization Boundaries: Expressibility vs Security/Reliability
The Tension: Allow teams enough expressibility to move fast, but prevent configurations that compromise security or reliability.
Guiding Principles:
- Default to Safe: Conservative defaults prevent common misconfigurations
- Progressive Permission: Teams earn more configurability through demonstrated responsibility
- Policy as Code: Configuration limits defined in version-controlled policies
- Fail Loudly: Invalid configurations rejected at deploy-time, not runtime
Configuration Permission Levels
Level 1: Guided (Default for All Teams)
- ✅ Allowed: Choose from pre-approved backends (Postgres, Kafka, Redis)
- ✅ Allowed: Set access patterns (
read_heavy
,write_heavy
,balanced
) - ✅ Allowed: Declare capacity estimates (within reasonable bounds)
- ✅ Allowed: Configure retention (up to organization maximum)
- ❌ Restricted: Backend-specific tuning parameters
- ❌ Restricted: Replication factors, partition counts
Example:
message UserEvents {
option (prism.backend) = "kafka"; // ✅ Allowed
option (prism.access_pattern) = "append_heavy"; // ✅ Allowed
option (prism.estimated_write_rps) = "10000"; // ✅ Allowed (within limits)
option (prism.retention_days) = "90"; // ✅ Allowed (< 180 day max)
}
Level 2: Advanced (Requires Platform Team Approval)
- ✅ Allowed: All Level 1 permissions
- ✅ Allowed: Backend-specific tuning (e.g., Kafka partition count)
- ✅ Allowed: Custom replication factors
- ✅ Allowed: Extended retention (up to 1 year)
- ❌ Restricted: Cross-region replication
- ❌ Restricted: Encryption key management overrides
Example:
message HighThroughputLogs {
option (prism.backend) = "kafka";
option (prism.kafka_partitions) = "50"; // ✅ Advanced permission required
option (prism.kafka_replication_factor) = "5"; // ✅ Advanced permission required
option (prism.retention_days) = "365"; // ✅ Advanced permission required
}
Level 3: Expert (Platform Team Only)
- ✅ Allowed: All Level 1 & 2 permissions
- ✅ Allowed: Cross-region replication
- ✅ Allowed: Custom encryption keys (BYOK)
- ✅ Allowed: Low-level performance tuning
- ✅ Allowed: Override safety limits
Policy Enforcement Mechanism
Configuration Validation at Deploy Time:
pub struct ConfigurationValidator {
policies: HashMap<String, TeamPolicy>,
}
pub struct TeamPolicy {
team_name: String,
permission_level: PermissionLevel,
limits: ConfigurationLimits,
}
pub struct ConfigurationLimits {
max_write_rps: i64,
max_read_rps: i64,
max_retention_days: i32,
max_data_size_gb: i64,
allowed_backends: Vec<String>,
backend_specific_tuning: bool,
}
impl ConfigurationValidator {
pub fn validate(&self, config: &MessageConfig, team: &str) -> Result<(), ValidationError> {
let policy = self.policies.get(team)
.ok_or(ValidationError::UnknownTeam(team.to_string()))?;
let limits = &policy.limits;
// Check RPS within limits
if config.estimated_write_rps > limits.max_write_rps {
return Err(ValidationError::ExceedsLimit {
field: "estimated_write_rps",
value: config.estimated_write_rps,
max: limits.max_write_rps,
message: format!(
"Team {} limited to {}k writes/sec. Request platform team approval for higher capacity.",
team, limits.max_write_rps / 1000
),
});
}
// Check retention within limits
if config.retention_days > limits.max_retention_days {
return Err(ValidationError::ExceedsLimit {
field: "retention_days",
value: config.retention_days,
max: limits.max_retention_days,
message: format!(
"Team {} limited to {} day retention. Longer retention requires compliance review.",
team, limits.max_retention_days
),
});
}
// Check backend in allowed list
if let Some(backend) = &config.backend {
if !limits.allowed_backends.contains(backend) {
return Err(ValidationError::DisallowedBackend {
backend: backend.clone(),
allowed: limits.allowed_backends.clone(),
message: format!(
"Backend '{}' not approved for team {}. Allowed backends: {}",
backend, team, limits.allowed_backends.join(", ")
),
});
}
}
// Check backend-specific tuning permissions
if config.has_backend_tuning() && !limits.backend_specific_tuning {
return Err(ValidationError::PermissionDenied {
field: "backend tuning parameters",
message: format!(
"Team {} does not have permission for backend-specific tuning. Request 'Advanced' permission level.",
team
),
});
}
Ok(())
}
}
Example Policy Configuration (policies/teams.yaml
):
teams:
# Most teams start here
- name: user-platform-team
permission_level: guided
limits:
max_write_rps: 50000
max_read_rps: 100000
max_retention_days: 180
max_data_size_gb: 1000
allowed_backends: [postgres, kafka, redis]
backend_specific_tuning: false
# Teams with demonstrated expertise
- name: data-infrastructure-team
permission_level: advanced
limits:
max_write_rps: 500000
max_read_rps: 1000000
max_retention_days: 365
max_data_size_gb: 10000
allowed_backends: [postgres, kafka, redis, nats, clickhouse]
backend_specific_tuning: true
# Platform team has unrestricted access
- name: platform-team
permission_level: expert
limits:
max_write_rps: unlimited
max_read_rps: unlimited
max_retention_days: unlimited
max_data_size_gb: unlimited
allowed_backends: [all]
backend_specific_tuning: true
cross_region_replication: true
custom_encryption_keys: true
Permission Escalation Workflow
Scenario: Team needs higher capacity than allowed by policy.
Workflow:
- Team deploys configuration with
estimated_write_rps: 100000
- Validation fails:
Team user-platform-team limited to 50k writes/sec
- Team opens request: "Increase RPS limit to 100k for user-events namespace"
- Platform team reviews:
- Is the estimate reasonable? (check current metrics)
- Will this impact cluster capacity? (check resource availability)
- Is the backend choice optimal? (suggest alternatives if not)
- If approved, update
policies/teams.yaml
:- name: user-platform-team
permission_level: guided
limits:
max_write_rps: 100000 # ← Increased - Team redeploys successfully
Key Benefits:
- Audit Trail: All permission changes version-controlled
- Gradual Escalation: Teams earn trust over time
- Central Oversight: Platform team maintains visibility
- Fast Approval: Simple cases auto-approved via policy updates
Common Configuration Mistakes Prevented
1. Excessive Retention Leading to Cost Overruns
// ❌ Rejected at deploy time
message DebugLogs {
option (prism.retention_days) = "3650"; // 10 years!
// Error: Team limited to 180 days. Compliance review required for >1 year retention.
}
2. Wrong Backend for Access Pattern
// ⚠️ Warning at deploy time
message HighThroughputEvents {
option (prism.backend) = "postgres";
option (prism.access_pattern) = "append_heavy";
option (prism.estimated_write_rps) = "50000";
// Warning: Postgres may struggle with 50k writes/sec. Consider Kafka for append-heavy workloads.
}
3. Over-Provisioning Resources
// ❌ Rejected at deploy time
message UserSessions {
option (prism.estimated_write_rps) = "100000";
option (prism.kafka_partitions) = "500"; // Way too many!
// Error: 500 partitions for 100k writes/sec is excessive. Recommended: 200 partitions (500 writes/partition/sec).
}
Organizational Benefits
Before Client-Originated Configuration:
- Infrastructure team: 10 people
- Application teams: 50 teams (500 engineers)
- Bottleneck: 2-4 week provisioning delays
- Cost: Infrastructure team growth required to scale
After Client-Originated Configuration with Authorization Boundaries:
- Infrastructure team: 10 people (maintain Prism platform)
- Application teams: 50 teams (self-service)
- Bottleneck: Eliminated for 90% of requests, escalation path for 10%
- Cost: Infrastructure team size stays constant
Scaling Math:
- Without Prism: 1 DBA per 10 teams → 50 teams needs 5 DBAs
- With Prism: Platform team of 10 supports 500+ teams (50x improvement)
Future Enhancements
Automated Permission Elevation:
auto_approve_conditions:
- if: team.track_record > 6_months && team.incidents == 0
then: grant permission_level: advanced
- if: config.estimated_write_rps < current_metrics.write_rps * 1.5
then: auto_approve # Only 50% increase, low risk
Cost Budgeting Integration:
message ExpensiveData {
option (prism.estimated_cost_per_month) = 5000; // $5k/month
option (prism.team_budget_limit) = 10000; // $10k/month
// Auto-approved if within budget, requires approval if over
}
Implementation Notes
Protobuf Extensions
Define custom options in prism/options.proto
:
syntax = "proto3";
package prism;
import "google/protobuf/descriptor.proto";
extend google.protobuf.MessageOptions {
// Access pattern hint
string access_pattern = 50001; // "read_heavy" | "write_heavy" | "append_heavy" | "balanced"
// Capacity estimates
int64 estimated_read_rps = 50002;
int64 estimated_write_rps = 50003;
int64 data_size_estimate_mb = 50004;
// Policies
int32 retention_days = 50005;
string consistency = 50006; // "strong" | "eventual" | "causal"
int32 latency_p99_ms = 50007;
// Backend override (optional)
string backend = 50008; // "postgres" | "kafka" | "sqlite" | etc.
}
extend google.protobuf.FieldOptions {
// Index type
string index = 50101; // "primary" | "secondary" | "partition_key" | "clustering_key"
// PII tagging
string pii = 50102; // "email" | "name" | "ssn" | etc.
// Encryption
bool encrypt_at_rest = 50103;
}
Capacity Planner Algorithm
struct CapacityPlanner;
impl CapacityPlanner {
fn plan(&self, config: &MessageConfig) -> InfrastructureSpec {
// 1. Select backend based on access pattern
let backend = self.select_backend(config);
// 2. Calculate required capacity
let capacity = match backend {
Backend::Kafka => self.plan_kafka(config),
Backend::Postgres => self.plan_postgres(config),
Backend::Nats => self.plan_nats(config),
// ...
};
// 3. Return infrastructure specification
InfrastructureSpec {
backend,
capacity,
policies: self.extract_policies(config),
}
}
fn select_backend(&self, config: &MessageConfig) -> Backend {
if let Some(explicit) = config.backend {
return explicit;
}
match config.access_pattern {
"append_heavy" => Backend::Kafka,
"read_heavy" if config.supports_sql() => Backend::Postgres,
"balanced" => Backend::Postgres,
"graph" => Backend::Neptune,
_ => Backend::Postgres, // Safe default
}
}
fn plan_kafka(&self, config: &MessageConfig) -> KafkaCapacity {
// Rule of thumb: 500 writes/sec per partition
let partitions = (config.estimated_write_rps / 500).max(1);
// Calculate retention storage
let daily_data_mb = (config.estimated_write_rps * 86400 * config.avg_message_size_bytes) / 1_000_000;
let retention_storage_gb = daily_data_mb * config.retention_days / 1000;
KafkaCapacity {
partitions,
replication_factor: 3, // Default for durability
retention_storage_gb,
instance_type: self.select_kafka_instance_type(config),
}
}
}
Evolution Strategy
Phase 1 (MVP): Support explicit backend selection
option (prism.backend) = "postgres";
Phase 2: Add access pattern hints
option (prism.access_pattern) = "read_heavy";
option (prism.estimated_read_rps) = "10000";
Phase 3: Automatic backend selection based on patterns
Phase 4: Continuous optimization using runtime metrics
References
- Netflix Data Gateway Deployment Configuration
- AWS Well-Architected Framework - Capacity Planning
- Google SRE Book - Capacity Planning
- ADR-003: Protobuf as Single Source of Truth
Revision History
- 2025-10-05: Initial draft and acceptance