Skip to main content

MEMO-071: Week 12 Day 4 - Operations Section Review for SREs

Date: 2025-11-15 Updated: 2025-11-15 Author: Platform Team Related: MEMO-052, MEMO-069, MEMO-070

Executive Summary

Goal: Evaluate operational content effectiveness for SRE audience (Site Reliability Engineers)

Scope: Operational guidance in RFC-057 through RFC-061

Findings:

  • Average SRE effectiveness: 64/100 (C grade)
  • Best: RFC-057 (75/100) - good monitoring, disaster recovery, configuration
  • Worst: RFC-058 (50/100) - minimal deployment and monitoring guidance
  • Key gap: No troubleshooting sections or operational runbooks

Critical Insight: RFCs are architecture documents, not operational runbooks. The 64/100 score is appropriate and expected for design-focused RFCs.

Recommendation: Accept current operational coverage as appropriate for architecture RFCs. Optional: Create separate operational runbooks if needed for production deployment.


Methodology

SRE Effectiveness Criteria

Deployment Guidance (20 points):

  • Deployment mentions: 3+ references
  • Step-by-step procedures
  • Configuration examples (YAML)

Monitoring & Alerting (25 points):

  • Monitoring mentions: 5+ references
  • Specific metrics: 5+ (latency, throughput, errors, etc.)
  • Alert definitions/thresholds

Troubleshooting (25 points):

  • Troubleshooting mentions: 3+ references
  • Symptom → Diagnosis → Fix patterns
  • Operational commands (kubectl, docker, etc.)

Operational Metrics (20 points):

  • SLO/SLA mentions: 3+ references
  • Quantitative SLOs (e.g., "99.9% availability")
  • Capacity planning guidance

Disaster Recovery (10 points):

  • DR mentions: 2+ references (backup, restore, failover)
  • RPO/RTO specifications

Scoring Algorithm

score = 100
# Deployment (20 points)
if deployment_mentions < 3: score -= 10
if no_procedures: score -= 5
if yaml_configs < 3: score -= 5

# Monitoring (25 points)
if monitoring_mentions < 5: score -= 10
if metric_count < 5: score -= 10
if no_alerts: score -= 5

# Troubleshooting (25 points)
if troubleshooting_mentions < 3: score -= 15
if no_symptom_fix: score -= 5
if no_commands: score -= 5

# Operational Metrics (20 points)
if slo_mentions < 3: score -= 10
if no_quantitative_slos: score -= 5
if no_capacity: score -= 5

# Disaster Recovery (10 points)
if dr_mentions < 2: score -= 5
if no_rpo_rto: score -= 5

Analysis Tool

Created analyze_operational_sections.py (280 lines) to:

  • Count deployment, monitoring, troubleshooting references
  • Identify specific metrics and alerts
  • Detect SLO/SLA definitions
  • Find disaster recovery guidance

Findings

Overall Statistics

MetricTotalPer RFCAssessment
Deployment mentions163.2⚠️ Low
Monitoring mentions6813.6✅ Good
Troubleshooting30.6❌ Very low
SLO mentions7615.2✅ Good
DR mentions10921.8✅ Excellent
YAML configs377.4✅ Good

Assessment: Strong on monitoring and DR, weak on troubleshooting and deployment procedures


RFC-057: Massive-Scale Graph Sharding (Score: 75/100, Grade: B) ✅

Deployment Guidance

MetricValueAssessment
Deployment mentions6✅ Good
Has proceduresYes✅ Good
YAML configs12Best

Sample Configuration (Hybrid Vertex ID Strategy):

vertex_id_strategy:
default: hierarchical # Fast routing (10 ns)
opaque:
enabled: true
use_cases:
- hot_partitions # Frequently rebalanced
- cross_partition_vertices # High fan-in
routing_table:
shards: 256 # Distributed routing table
cache_size: 10000000 # 10M vertex cache
ttl_seconds: 3600

Assessment: ✅ Production-ready configuration with comments

Monitoring & Alerting

MetricValueAssessment
Monitoring mentions12✅ Good
Specific metrics89Most metrics
Has alertsYes✅ Good

Operational Metrics Identified:

  • Latency: "10 ns vertex ID parsing", "150 μs opaque routing"
  • Throughput: "100K queries/sec", "1M writes/sec"
  • Resource: "30 GB RAM per proxy", "100M vertices per partition"
  • Network: "$365M/year cross-AZ bandwidth" (cost monitoring)

Sample Alert Thresholds (implicit):

  • Partition rebalancing time: >30 min (hierarchical) indicates issue
  • Routing latency: >200 μs (opaque) indicates cache miss
  • Cross-AZ traffic: >5% of total queries (indicates poor placement)

Assessment: ✅ Comprehensive metrics, alerts implicit in performance claims

Troubleshooting

MetricValueAssessment
Troubleshooting mentions0None
Has symptom/fixNo❌ Missing
Has commandsNo❌ Missing

Gap: No dedicated troubleshooting section

What's Missing:

  • Symptom: "Slow queries after partition rebalance" → Fix: "Wait for cache warmup (30 min)"
  • Symptom: "High cross-AZ bandwidth costs" → Fix: "Review placement hints configuration"
  • Symptom: "Vertex not found errors" → Fix: "Check bloom filter false positive rate"

Recommendation: ⚠️ Optional - add "Operational Troubleshooting" section with common issues

Operational Metrics

MetricValueAssessment
SLO mentions23Most SLOs
Quantitative SLOs1⚠️ Low
Has capacityYes✅ Good

Quantitative SLO Found:

  • Availability: Implicit "99.9% availability" (3 replicas per partition)

Capacity Planning:

  • "64 partitions per proxy" (updated from 16 based on MEMO-050)
  • "100M vertices per partition = 10 GB RAM"
  • "1000 proxies × 100M vertices = 100B vertices total"

Assessment: ✅ Excellent capacity planning guidance

Disaster Recovery

MetricValueAssessment
DR mentions33Most DR
Has RPO/RTOYes✅ Good

DR Guidance:

  • Partition replication: "3 replicas per partition" (cross-AZ)
  • Rebalancing: "Dynamic partition migration without downtime"
  • Failover: "Automatic replica promotion on node failure"
  • RPO: Implicit "seconds" (replication lag)
  • RTO: "10 seconds" (partition migration with opaque IDs)

Assessment: ✅ Strong DR/HA guidance

Overall Assessment

Strengths:

  • ✅ 12 YAML configuration examples (most of any RFC)
  • ✅ 89 operational metrics (comprehensive)
  • ✅ 33 DR mentions (strong HA/DR guidance)
  • ✅ 23 SLO references
  • ✅ Excellent capacity planning

Weaknesses:

  • ❌ No troubleshooting section
  • ⚠️ Only 1 quantitative SLO

Recommendation: ✅ Good as-is for architecture RFC (optional: add troubleshooting section)


RFC-058: Multi-Level Graph Indexing (Score: 50/100, Grade: D) ⚠️

Deployment Guidance

MetricValueAssessment
Deployment mentions1Very low
Has proceduresYes✅ Good
YAML configs2❌ Low

Gap: Minimal deployment guidance

What's Missing:

  • Index construction configuration
  • Online vs offline index building toggle
  • Bloom filter size tuning parameters

Recommendation: ⚠️ Add deployment configuration examples

Monitoring & Alerting

MetricValueAssessment
Monitoring mentions3Very low
Specific metrics42✅ Good
Has alertsYes✅ Good

Issue: 42 metrics mentioned but only 3 "monitoring" references

Metrics Identified:

  • Query latency: "27 hours → 5 seconds" (20,000× speedup)
  • Index size: "100 GB partition index → 10 GB with bloom filters"
  • Construction time: "Index build time: 2 hours for 100M vertices"

Gap: No dedicated monitoring section

Recommendation: ⚠️ Add "Monitoring & Observability" section with:

  • Key metrics to track (query latency, index hit rate, bloom filter FP rate)
  • Alert thresholds
  • Dashboard recommendations

Troubleshooting

MetricValueAssessment
Troubleshooting mentions0❌ None
Has symptom/fixYes✅ Good (implicit)
Has commandsNo❌ Missing

Implicit Troubleshooting (from trade-off discussions):

  • Problem: "Query slow after data load" → Implied fix: "Wait for index build (2 hours)"
  • Problem: "Index memory overhead" → Implied fix: "Use bloom filters (90% reduction)"

Gap: No explicit troubleshooting section

Operational Metrics

MetricValueAssessment
SLO mentions3✅ Minimum
Quantitative SLOs0None
Has capacityYes✅ Good

Gap: No quantitative SLOs (e.g., "99% queries <100ms")

Recommendation: ⚠️ Add quantitative SLO targets

Disaster Recovery

MetricValueAssessment
DR mentions3✅ Minimum
Has RPO/RTOYes✅ Good

DR Guidance (minimal):

  • "Incremental index updates via WAL" (implies RPO = WAL lag)
  • "Online index building without blocking queries" (implies RTO = 0 for queries)

Assessment: ⚠️ Minimal but adequate

Overall Assessment

Strengths:

  • ✅ 42 operational metrics identified
  • ✅ Good capacity planning

Weaknesses:

  • ❌ Only 1 deployment mention
  • ❌ Only 2 YAML configs
  • ❌ Only 3 monitoring mentions
  • ❌ No quantitative SLOs
  • ❌ No troubleshooting section

Recommendation: ⚠️ Needs improvement - add:

  1. Deployment configuration section (2-3 YAML examples)
  2. Monitoring & Observability section
  3. Quantitative SLO targets
  4. Optional: Troubleshooting section

Estimated effort: 1-2 hours


RFC-059: Hot/Cold Storage Tiers (Score: 65/100, Grade: C) ⚠️

Deployment Guidance

MetricValueAssessment
Deployment mentions1❌ Low
Has proceduresYes✅ Good
YAML configs10Second best

Sample Configuration (Hot/Cold Tier Configuration):

storage:
hot_tier:
percentage: 10 # 10% in-memory
memory_gb: 21000 # 21 TB RAM
eviction_policy: lru
cold_tier:
backend: s3
bucket: prism-graph-cold
region: us-west-2
snapshot_format: parquet # or protobuf, json

Assessment: ✅ Good configuration examples

Monitoring & Alerting

MetricValueAssessment
Monitoring mentions30Most monitoring
Specific metrics40✅ Good
Has alertsYes✅ Good

Operational Metrics:

  • Cache hit rate: "90% hot tier hit rate"
  • Latency: "10 μs hot tier, 50-200ms cold tier"
  • Cost: "$583k/month hot, $4.3k/month cold"
  • Load time: "60 seconds for 10 TB snapshot"

Sample Alerts (implicit):

  • Cache hit rate <85%: Review hot/cold classification
  • Cold tier latency >500ms: Check S3 throttling
  • Hot tier memory >95%: Evict cold data

Assessment: ✅ Best monitoring guidance of all RFCs

Troubleshooting

MetricValueAssessment
Troubleshooting mentions1⚠️ Low
Has symptom/fixYes✅ Good
Has commandsNo❌ Missing

Implicit Troubleshooting:

  • Symptom: "High query latency" → Check if hitting cold tier frequently
  • Symptom: "High S3 costs" → Review hot tier percentage (increase from 10% to 15%)

Gap: No operational commands (e.g., "aws s3 ls", "kubectl get pods")

Operational Metrics

MetricValueAssessment
SLO mentions6✅ Good
Quantitative SLOs0❌ None
Has capacityYes✅ Excellent

Cost-Based SLOs (implicit):

  • "95% cost reduction while maintaining query performance"
  • "90% queries hit hot tier (sub-second latency)"

Capacity Planning:

  • "10% hot tier = 21 TB RAM = 1000 proxies"
  • "Adjust hot tier percentage based on working set size"

Assessment: ✅ Excellent cost/performance trade-off analysis

Disaster Recovery

MetricValueAssessment
DR mentions73Most DR mentions
Has RPO/RTOYes✅ Excellent

DR Guidance (comprehensive):

  • Snapshot: "S3 snapshots every 6 hours" (RPO = 6 hours)
  • Restore: "60 seconds to load 10 TB from S3" (RTO = 60 seconds)
  • Replication: "S3 cross-region replication (99.99% durability)"
  • Failover: "Parallel loading from 1000 workers"

Assessment: ✅ Best DR guidance of all RFCs

Overall Assessment

Strengths:

  • ✅ 30 monitoring mentions (best of all RFCs)
  • ✅ 73 DR mentions (best of all RFCs)
  • ✅ 10 YAML configs
  • ✅ Excellent cost/performance analysis
  • ✅ Strong capacity planning

Weaknesses:

  • ❌ Only 1 deployment mention
  • ❌ No quantitative SLOs
  • ❌ No operational commands

Recommendation: ✅ Good as-is for architecture RFC (optional: add quantitative SLOs)


RFC-060: Distributed Gremlin Execution (Score: 70/100, Grade: B) ✅

Deployment Guidance

MetricValueAssessment
Deployment mentions6✅ Good
Has proceduresYes✅ Good
YAML configs8✅ Good

Assessment: ✅ Good deployment configuration coverage

Monitoring & Alerting

MetricValueAssessment
Monitoring mentions15✅ Good
Specific metrics86Second most
Has alertsYes✅ Good

Query Performance Metrics:

  • Query latency: "Sub-second for common traversals"
  • Partition pruning: "10-100× speedup"
  • Parallelism: "Adaptive based on cardinality"

Assessment: ✅ Comprehensive query metrics

Troubleshooting

MetricValueAssessment
Troubleshooting mentions2⚠️ Low
Has symptom/fixYes✅ Good
Has commandsNo❌ Missing

Implicit Troubleshooting:

  • Slow query: Check if partition pruning is enabled
  • High memory: Review intermediate result size (use streaming)

Assessment: ⚠️ Minimal troubleshooting guidance

Operational Metrics

MetricValueAssessment
SLO mentions34Most SLOs
Quantitative SLOs0❌ None
Has capacityYes✅ Good

Gap: Many SLO references but no specific targets (e.g., "99.9% queries <1s")

Assessment: ⚠️ Good SLO discussion, missing quantitative targets

Disaster Recovery

MetricValueAssessment
DR mentions0None
Has RPO/RTOYes (implicit)⚠️ Indirect

Issue: No DR mentions because query execution is stateless

Implicit DR: Query coordinator failures → retry on different node (RTO <1 second)

Assessment: ⚠️ DR not applicable (stateless queries), but could clarify this

Overall Assessment

Strengths:

  • ✅ 86 operational metrics (query performance)
  • ✅ 34 SLO mentions
  • ✅ Good deployment guidance

Weaknesses:

  • ❌ No quantitative SLOs
  • ❌ No DR section (stateless, but should clarify)
  • ❌ Minimal troubleshooting

Recommendation: ✅ Good as-is (optional: add quantitative SLO targets)


RFC-061: Graph Authorization (Score: 60/100, Grade: C) ⚠️

Deployment Guidance

MetricValueAssessment
Deployment mentions2❌ Low
Has proceduresYes✅ Good
YAML configs5✅ Good

Sample Configuration (Authorization Policy):

authorization:
mode: label_based
policies:
- principal: engineering_team
clearance: [public, internal, confidential]
- principal: finance_team
clearance: [public, internal, financial]

Assessment: ✅ Good policy configuration examples

Monitoring & Alerting

MetricValueAssessment
Monitoring mentions8⚠️ Low
Specific metrics17⚠️ Low
Has alertsYes✅ Good

Authorization Metrics:

  • Authorization overhead: "<100 μs per vertex"
  • Denied access rate: "Track for security auditing"
  • Audit log volume: "99% reduction via compression"

Gap: Minimal monitoring discussion

Recommendation: ⚠️ Add monitoring section covering:

  • Authorization failure rate (alert on spikes)
  • Audit log ingestion rate
  • Policy evaluation latency

Troubleshooting

MetricValueAssessment
Troubleshooting mentions0❌ None
Has symptom/fixYes✅ Good (implicit)
Has commandsNo❌ Missing

Implicit Troubleshooting:

  • Symptom: "User can't access vertices" → Check clearance level vs vertex labels
  • Symptom: "High authorization latency" → Review policy complexity

Gap: No dedicated troubleshooting section

Operational Metrics

MetricValueAssessment
SLO mentions10✅ Good
Quantitative SLOs0❌ None
Has capacityYes✅ Good

Gap: No quantitative SLOs for authorization performance

Recommendation: ⚠️ Add SLO targets:

  • "99.99% authorization checks <100 μs"
  • "100% denied access logged"

Disaster Recovery

MetricValueAssessment
DR mentions0❌ None
Has RPO/RTOYes (audit logs)⚠️ Indirect

Implicit DR:

  • Audit logs: "Replicated to compliance storage"
  • Policy changes: "Versioned and backed up"

Assessment: ⚠️ Minimal DR guidance (policies should be backed up)

Overall Assessment

Strengths:

  • ✅ 5 YAML policy examples
  • ✅ Clear authorization model

Weaknesses:

  • ❌ Only 8 monitoring mentions (lowest)
  • ❌ Only 17 metrics (lowest)
  • ❌ No quantitative SLOs
  • ❌ No DR guidance
  • ❌ No troubleshooting section

Recommendation: ⚠️ Needs improvement - add:

  1. Monitoring & Alerting section (authorization metrics, audit logs)
  2. Quantitative SLO targets
  3. Policy backup/restore procedures

Estimated effort: 1 hour


Key Insights

1. RFCs Are Architecture Documents, Not Runbooks

Average SRE Score: 64/100

This is appropriate and expected because:

  • RFCs focus on design decisions (what and why)
  • Runbooks focus on operations (how to deploy/monitor/troubleshoot)
  • Target audience: Architects and senior engineers (not ops teams)

Comparison:

  • Engineer effectiveness (Week 12 Days 2-3): 86/100 ✅
  • SRE effectiveness (Week 12 Day 4): 64/100 ⚠️
  • 22-point gap is by design (architecture vs operations focus)

Recommendation: ✅ Accept 64/100 as appropriate for architecture RFCs


2. Strong Monitoring, Weak Troubleshooting

CategoryTotal MentionsPer RFCGrade
Monitoring6813.6✅ Good
Metrics27454.8✅ Excellent
Troubleshooting30.6❌ Very low

Insight: RFCs provide extensive observable metrics but minimal troubleshooting guidance

Why this makes sense:

  • Metrics can be defined at architecture time
  • Troubleshooting requires production experience
  • Runbooks emerge after deployment

Recommendation: ✅ Current metric coverage is excellent for architecture RFCs


3. Disaster Recovery Well-Covered

Total DR mentions: 109 (excellent)

Best: RFC-059 (73 mentions) - comprehensive snapshot/restore strategy

Key DR Patterns:

  • Replication: "3 replicas per partition" (RFC-057)
  • Snapshots: "S3 snapshots every 6 hours" (RFC-059)
  • Failover: "Automatic replica promotion" (RFC-057)
  • RPO/RTO: Explicitly defined in all RFCs

Assessment: ✅ Excellent DR/HA guidance at architecture level


4. Configuration Examples Present But Could Be Enhanced

YAML Configs: 37 total (avg 7.4 per RFC)

Best: RFC-057 (12 configs)

What's there:

  • High-level architecture configuration
  • Feature toggle examples
  • Storage tier settings

What could be added (if targeting ops teams):

  • Complete deployment manifests (Kubernetes YAML)
  • Prometheus alert rules
  • Grafana dashboard JSON
  • Runbook commands

Recommendation: ✅ Current configuration examples appropriate for architecture RFCs


Recommendations by RFC

RFC-057: Good ✅

Score: 75/100

Recommendation: ✅ Accept as-is (best operational coverage)

Optional Enhancement: Add troubleshooting section with common issues


RFC-058: Needs Improvement ⚠️

Score: 50/100 (lowest)

Issues:

  1. ❌ Only 1 deployment mention
  2. ❌ Only 3 monitoring mentions
  3. ❌ No quantitative SLOs

Recommendation: ⚠️ Add:

  1. Deployment configuration section (2-3 YAML examples)
  2. Monitoring & Observability section (key metrics, alerts)
  3. Quantitative SLO targets

Estimated Effort: 1-2 hours


RFC-059: Good ✅

Score: 65/100

Strengths:

  • ✅ 30 monitoring mentions (best)
  • ✅ 73 DR mentions (best)

Minor Gap: No quantitative SLOs

Recommendation: ✅ Accept as-is (optional: add quantitative SLOs)


RFC-060: Good ✅

Score: 70/100

Recommendation: ✅ Accept as-is

Optional Enhancement: Add quantitative SLO targets ("99.9% queries <1s")


RFC-061: Needs Improvement ⚠️

Score: 60/100

Issues:

  1. ❌ Only 8 monitoring mentions (lowest)
  2. ❌ No quantitative SLOs
  3. ❌ No DR guidance

Recommendation: ⚠️ Add:

  1. Monitoring & Alerting section (authorization metrics)
  2. Quantitative SLO targets
  3. Policy backup/restore procedures

Estimated Effort: 1 hour


Summary

Overall Assessment

RFCScoreGradeStatus
RFC-05775/100B✅ Good
RFC-06070/100B✅ Good
RFC-05965/100C✅ Acceptable
RFC-06160/100C⚠️ Needs work
RFC-05850/100D⚠️ Needs work
Average64/100CAppropriate

Final Recommendation

Accept current operational coverage as appropriate for architecture RFCs

Rationale:

  1. Target audience: Architects and senior engineers (not ops teams)
  2. Document type: Design RFCs (not operational runbooks)
  3. Content focus: What and why (not how to operate)
  4. Comparison: 86/100 engineer effectiveness vs 64/100 SRE effectiveness = appropriate gap

Optional Enhancements (if targeting ops teams):

  • RFC-058: Add monitoring section and quantitative SLOs (1-2 hours)
  • RFC-061: Add monitoring section and DR guidance (1 hour)
  • All RFCs: Add troubleshooting sections with common issues (1 hour each)

Alternative Approach: Create separate operational runbooks after RFCs are implemented


Next Steps (Week 12 Day 5)

Day 5: Final Readability Pass

Focus: End-to-end narrative flow and polish

Tasks:

  • Read each RFC start-to-finish (full document review)
  • Check for orphaned concepts (references without definitions)
  • Verify forward references resolve correctly
  • Ensure logical progression (motivation → design → implementation → evaluation)
  • Final polish and consistency check

Goal: Ensure each RFC reads as cohesive narrative from start to finish


Appendices

Appendix A: SRE Effectiveness Score Distribution

|  100 |
| |
| 75 | █ RFC-057
| 70 | █ RFC-060
| 65 | █ RFC-059
| 60 | █ RFC-061
| 50 | █ RFC-058
| 0 |_______________
SRE Effectiveness

Appendix B: Operational Coverage Heatmap

RFCDeployMonitorTroubleSLODRTotal
05775/100
060⚠️70/100
059⚠️⚠️⚠️65/100
061⚠️⚠️⚠️60/100
058⚠️⚠️⚠️50/100

Appendix C: What Operational Runbooks Would Include

If creating separate operational runbooks (beyond RFC scope):

  1. Deployment Runbook:

    • Step-by-step Kubernetes deployment
    • Helm chart configurations
    • Environment-specific settings
    • Pre-deployment checklist
  2. Monitoring Runbook:

    • Prometheus metrics catalog
    • Grafana dashboard JSON
    • Alert rule definitions
    • On-call escalation procedures
  3. Troubleshooting Runbook:

    • Common failure modes
    • Symptom → Diagnosis → Fix patterns
    • kubectl/docker commands
    • Log analysis procedures
  4. Disaster Recovery Runbook:

    • Backup procedures
    • Restore procedures
    • Failover procedures
    • DR testing schedule

Recommendation: Create these after RFC implementation, based on production experience