RFC Edit Summary for MEMO-050 Production Readiness
Parent Document: MEMO-050: Production Readiness Analysis Date: 2025-11-15 Author: Platform Team
This document summarizes all required edits to RFCs 057-061 based on the production readiness analysis in MEMO-050.
RFC-057: Massive-Scale Graph Sharding
Edit 1: Update Partition Sizing (Finding 6)
Location: Line 269
Current: partitions_per_proxy: 16
Change To:
partitions_per_proxy: 64 # Increased from 16 for finer granularity
vertices_per_partition: 1_562_500 # Down from 6.25M
partition_size_mb: 156 # Down from 625 MB
rationale: |
- Finer hot/cold temperature management
- Faster rebalancing (156 MB chunks vs 625 MB)
- Better load distribution (lower variance)
- Smaller failure blast radius
Edit 2: Replace CRC32 with xxHash (Finding 7)
Location: Lines 290-300 Current:
crc32("user:alice") % 10 → cluster_id
Change To:
import "github.com/cespare/xxhash/v2"
func (p *Partitioner) GetClusterID(vertexID string) int {
h := xxhash.Sum64String(vertexID)
return int(h % uint64(p.NumClusters))
}
// Benefits:
// - 1.7× faster than CRC32
// - 8× better distribution (1.8% variance vs 15%)
// - Industry standard (RocksDB, ClickHouse)
Edit 3: Add Network Topology Awareness (Finding 3)
Location: After Line 227 (end of Section 4) Add New Section 4.6:
### Section 4.6: Network Topology-Aware Sharding
#### Partition Metadata Extension
message PartitionMetadata { string partition_id = 1; string cluster_id = 2; string proxy_id = 3;
// Network topology (NEW) NetworkLocation location = 4; }
message NetworkLocation { string region = 1; // "us-west-2" string availability_zone = 2; // "us-west-2a" string rack_id = 3; // "rack-42" (optional)
// Network characteristics float cross_az_bandwidth_gbps = 4; float cross_region_bandwidth_gbps = 5; }
#### Network Cost Model
Cross-AZ bandwidth costs **$0.01/GB**, which dominates at scale:
| Network Path | Latency | Cost | Priority |
|--------------|---------|------|----------|
| Same rack | 100 μs | $0/GB | Highest |
| Same AZ | 200 μs | $0/GB | High |
| Cross-AZ | 1-2 ms | $0.01/GB | Medium |
| Cross-region | 20-100 ms | $0.02/GB | Low |
**Cost Impact**:
- Without optimization: $600M/year in cross-AZ bandwidth
- With AZ-aware placement: $6M/year (5% cross-AZ traffic)
#### Locality-Aware Placement Strategy
func (nap *NetworkAwarePartitioner) AssignPartition( vertex *Vertex, placementHint *PlacementHint, ) string { // If hint provided, prefer co-location in same AZ if placementHint != nil && placementHint.CoLocateWithVertex != "" { targetPartition := nap.GetPartitionForVertex(placementHint.CoLocateWithVertex) targetAZ := nap.GetAZ(targetPartition)
// Find partition in same AZ
sameAZPartition := nap.FindPartitionInAZ(targetAZ)
if sameAZPartition != "" {
return sameAZPartition
}
}
// Fall back to consistent hashing
return nap.ConsistentHash(vertex.ID)
}
#### Multi-AZ Deployment Strategy
cluster_topology: region: us-west-2
availability_zones: - id: us-west-2a proxies: 334 primary_for: [cluster_0, cluster_3, cluster_6, cluster_9]
- id: us-west-2b
proxies: 333
primary_for: [cluster_1, cluster_4, cluster_7]
- id: us-west-2c
proxies: 333
primary_for: [cluster_2, cluster_5, cluster_8]
partition_replication: factor: 3 # 3 replicas per partition placement: different_az # Replicas in different AZs read_preference: primary_az # Read from local AZ first
#### Cost Savings
| Scale | Without AZ Awareness | With AZ Awareness | Savings |
|-------|---------------------|-------------------|---------|
| 10B vertices | $45M/year | $15M/year | 67% |
| 100B vertices | $600M/year | $30M/year | 95% |
**Recommendation**: Network-aware placement is **mandatory** at 100B scale.
Edit 4: Add Failure Detection and Recovery (Finding 9)
Location: Add new Section 7 after Section 6 Add:
## Section 7: Failure Detection and Recovery
At 1000 nodes, expect **~1 node failure per day** (typical MTBF). Comprehensive failure handling is critical.
### 7.1 Node Failure Detection
**Heartbeat Protocol**:
heartbeat_config: interval: 10s # Send heartbeat every 10 seconds timeout: 30s # Mark dead after 3 missed heartbeats grace_period: 60s # Wait before triggering recovery
detection_method: type: phi_accrual # Φ accrual failure detector threshold: 8.0 # Confidence level for failure
**Gossip Protocol**:
Each node maintains cluster membership:
- Gossip interval: 1 second
- Gossip fanout: 3 random nodes
- Suspicion threshold: 3 nodes report failure
### 7.2 Partition Recovery Options
**Option A: Replica Failover** (Fast, 10 seconds)
replication_strategy: replicas_per_partition: 3 placement: different_az consistency: eventual
failover_process:
- Detect primary partition failure (30s)
- Elect new primary from replicas (10s)
- Update routing table (5s)
- Resume serving traffic (5s)
total_recovery_time: ~50 seconds data_loss: None (replicas up-to-date)
**Option B: S3 Restore** (Slow, 5 minutes)
s3_restore_strategy: trigger: No replicas available
restore_process: 1. Download partition snapshot from S3 (2 min) 2. Replay WAL since last snapshot (2 min) 3. Rebuild indexes (1 min) 4. Resume serving (30s)
total_recovery_time: ~5 minutes data_loss: None (WAL replay) use_case: All replicas failed (rare)
### 7.3 Cascading Failure Prevention
**Circuit Breaker**:
type CascadePrevent struct { failureThreshold float64 // 0.05 = 5% node failures actionTimeout time.Duration }
func (cp *CascadePrevent) MonitorClusterHealth() { failureRate := float64(cp.FailedNodes) / float64(cp.TotalNodes)
if failureRate > cp.failureThreshold {
// Stop rebalancing (avoid cascading load)
cp.PauseRebalancing()
// Enable load shedding (drop low-priority queries)
cp.EnableLoadShedding()
// Alert operations team
cp.AlertCritical("Cascading failure risk: %.1f%% nodes down",
failureRate*100)
}
}
**Backpressure**:
backpressure_policy: trigger: partition_unavailable_rate > 10%
actions: - slow_down_writes: 50% - reject_bulk_operations: true - increase_retry_delays: 2x - alert_clients: "degraded_mode"
recovery: condition: partition_unavailable_rate < 5% duration: sustained for 5 minutes
### 7.4 Operational Runbook
**Incident: Single Node Failure**
Detection: Heartbeat missed for 30s Action:
- Check replica availability
- If replicas healthy: Promote replica to primary (50s)
- If no replicas: Restore from S3 (5 min)
- Update monitoring dashboards
- Create JIRA for node replacement
**Incident: AZ Failure (33% of cluster)**
Detection: All nodes in AZ unreachable Action:
- Verify: AWS service health dashboard
- Failover: Promote replicas in other AZs (2 min)
- Load distribution: Redistribute traffic (5 min)
- Monitoring: Watch for cascading failures
- Capacity: Add temporary nodes if needed
**Incident: Cascading Failure (>10% nodes)**
Detection: Circuit breaker triggered Action:
- Stop rebalancing immediately
- Enable load shedding (drop 20% lowest priority queries)
- Investigate root cause (memory leak? query storm?)
- Add capacity if needed
- Gradual recovery (don't overwhelm remaining nodes)
Edit 5: Add Opaque Vertex IDs (Finding 15)
Location: Lines 231-261 (Vertex ID Format section)
Current: Hierarchical format cluster:proxy:partition:local_id
Add Discussion:
#### Trade-off: Hierarchical vs Opaque IDs
**Current Design**: Hierarchical (encoded topology)
vertex_id: "02:0045:12:user:alice"
**Pros**:
- Fast routing (parse ID to get partition)
- No routing table lookup needed
- Deterministic placement
**Cons**:
- Rebalancing requires ID rewrite
- Clients depend on internal structure
- Cannot change topology without data migration
**Alternative**: Opaque IDs with routing table
vertex_id: "v_9f8e7d6c5b4a3918" # UUID routing_table["v_9f8e7d6c5b4a3918"] → partition_id: "02:0045:12"
**Pros**:
- Topology-independent IDs
- Free rebalancing (update routing table only)
- Clients don't depend on internal structure
**Cons**:
- Routing table lookup overhead (~1 μs)
- Routing table storage (~1 GB for 100B vertices)
- Complexity: Distributed routing table updates
**Recommendation for 100B scale**: Use **opaque IDs** with distributed routing table. The flexibility for rebalancing outweighs the 1 μs overhead.
Implementation:
type RoutingTable struct { entries map[string]string // vertex_id → partition_id cache *lru.Cache // Local LRU cache }
func (rt *RoutingTable) GetPartition(vertexID string) (string, error) { // Check local cache first if partitionID, ok := rt.cache.Get(vertexID); ok { return partitionID.(string), nil }
// Fall back to distributed table
partitionID, err := rt.lookupDistributed(vertexID)
if err != nil {
return "", err
}
// Cache result
rt.cache.Add(vertexID, partitionID)
return partitionID, nil
}
RFC-058: Multi-Level Graph Indexing
Edit 6: Add Index Tiering for Memory Reconciliation (Finding 5)
Location: Add new Section 6.5 after Section 6.4 Add:
## Section 6.5: Index Tiering Strategy
### Memory Capacity Reconciliation
**Problem**: RFC-057 allocates 30 TB total memory, but naive index storage requires:
Indexes: 16 TB (11.4% overhead from Section 4.4) Hot data: 21 TB (from RFC-059) Total: 37 TB
Available: 30 TB Deficit: 7 TB (19% over capacity)
**Solution**: Tier indexes like data (hot/warm/cold)
### Index Temperature Classification
type IndexTemperature int
const ( INDEX_HOT IndexTemperature = iota // In memory INDEX_WARM // On local SSD INDEX_COLD // On S3 )
func (im *IndexManager) ClassifyIndexTemperature( partitionID string, ) IndexTemperature { metrics := im.GetIndexMetrics(partitionID)
switch {
case metrics.QueriesPerMinute > 1000:
return INDEX_HOT // Keep in memory
case metrics.QueriesPerMinute > 10:
return INDEX_WARM // Move to SSD
default:
return INDEX_COLD // Move to S3
}
}
### Memory Allocation (Revised)
memory_budget: total_available: 30 TB
allocation: hot_indexes: 1.6 TB # 10% partitions × 16 GB = 1.6 TB hot_data: 15 TB # 7% of 210 TB (reduced from 10%) warm_cache: 5 TB # In-memory cache for warm tier query_buffers: 3 TB # Query execution memory os_overhead: 2 TB # OS, JVM, buffers reserved: 3.4 TB # Headroom for spikes
total_used: 26.6 TB utilization: 89% headroom: 11%
### Index Storage Locations
| Index Tier | Storage | Access Latency | Use Case |
|------------|---------|----------------|----------|
| **Hot** | RAM | 50 μs | Frequently queried partitions (>1000 qpm) |
| **Warm** | SSD | 100 μs | Occasionally queried (10-1000 qpm) |
| **Cold** | S3 | 50-100 ms | Rarely queried (<10 qpm) |
### Index Promotion/Demotion
func (im *IndexManager) ManageIndexTemperature() { ticker := time.NewTicker(60 * time.Second)
for range ticker.C {
for _, partition := range im.GetAllPartitions() {
currentTemp := partition.IndexTemperature
targetTemp := im.ClassifyIndexTemperature(partition.ID)
if currentTemp != targetTemp {
im.TransitionIndex(partition.ID, targetTemp)
}
}
}
}
func (im *IndexManager) TransitionIndex( partitionID string, targetTemp IndexTemperature, ) error { partition := im.GetPartition(partitionID)
switch targetTemp {
case INDEX_HOT:
// Load index from SSD/S3 to memory
return im.LoadIndexToMemory(partitionID)
case INDEX_WARM:
// Move index from memory to SSD
if partition.IndexTemperature == INDEX_HOT {
return im.EvictIndexToSSD(partitionID)
}
// Or load from S3 to SSD
return im.LoadIndexToSSD(partitionID)
case INDEX_COLD:
// Evict index to S3
return im.EvictIndexToS3(partitionID)
}
return nil
}
### Performance Trade-offs
| Query Type | Hot Index | Warm Index (SSD) | Cold Index (S3) |
|------------|-----------|------------------|-----------------|
| **Property lookup** | 50 μs | 150 μs (3× slower) | 100 ms (2000× slower) |
| **Range scan** | 2 ms | 10 ms (5× slower) | 2 s (1000× slower) |
**Key Insight**: At 100B scale, treating indexes as tiered resources (like data) is **essential** for memory efficiency.
Edit 7: Add Index Versioning (Finding 13)
Location: Line 175 (PartitionIndex protobuf) Current:
message PartitionIndex {
string partition_id = 1;
int64 vertex_count = 2;
int64 edge_count = 3;
int64 last_updated = 4;
...
}
Change To:
message PartitionIndex {
string partition_id = 1;
int32 schema_version = 2; // NEW: Track index format version
int64 vertex_count = 3;
int64 edge_count = 4;
int64 last_updated = 5;
...
}
// Version history:
// v1: Hash + Range indexes only
// v2: + Inverted indexes
// v3: + Edge indexes
// v4: + Bloom filters
// v5: + Super-node optimizations
Add Migration Strategy:
func (il *IndexLoader) LoadIndex(partitionID string) (*PartitionIndex, error) {
data, err := il.ReadIndexFile(partitionID)
if err != nil {
return nil, err
}
index := &PartitionIndex{}
proto.Unmarshal(data, index)
// Check version compatibility
switch index.SchemaVersion {
case 4:
// Current version, use directly
return index, nil
case 3:
// Older version, missing bloom filters
// Upgrade on-the-fly
return il.UpgradeFromV3ToV4(index)
case 2:
// Very old, missing edge indexes
return il.UpgradeFromV2ToV4(index)
default:
return nil, fmt.Errorf("unsupported index version: %d",
index.SchemaVersion)
}
}
RFC-059: Hot/Cold Storage Tiers
Edit 8: Add S3 Request Cost Optimization (Finding 1)
Location: After Section 3 (Cost Analysis) Add New Section 8.5:
## Section 8.5: S3 Request Cost Optimization
### The Hidden Cost of S3
**Problem**: Initial cost analysis (Section 3) only considered storage costs:
Storage: 189 TB × $23/TB = $4,347/month
**Reality**: S3 request costs dominate at scale:
Query load: 1B queries/sec (from RFC-060) Cache miss rate: 10% × 90% cold = 9% S3 GETs/sec: 90M Cost: 90M × ($0.0004/1000) = $36/second = $93.3M/month
Total: $93.3M storage + requests (159× higher than stated)
### Multi-Tier Caching Architecture
**Tier 0: Proxy-Local Cache** (New)
varnish_cache: size_per_proxy: 100 GB (local SSD) ttl: 3600s (1 hour) eviction: LRU
effect: cache_hit_rate: 95% (up from 90%) s3_requests: 50M/sec (down from 90M)
**Tier 1: CloudFront CDN**
cloudfront_config: price_class: PriceClass_100 (US/Europe/Asia) request_cost: $0.0075 per 10K requests (85× cheaper than S3)
effect: cache_hit_rate: 98% at CDN level s3_requests: 1M/sec (down from 50M) monthly_cost: $2.6M CDN + $1.04M S3 = $3.64M
**Tier 2: S3 Express One Zone** (for hot partitions)
s3_express_config: storage_cost: $0.16/GB-month (vs $0.023 Standard) latency: 5-10ms (vs 50-100ms Standard) use_case: Top 10% hottest partitions (21 TB)
trade_off: storage_cost: 21 TB × $160/TB = $3,360/month (7× higher) benefit: 10× lower latency for hot queries
**Tier 3: Batch S3 Reads**
// Instead of: 1 S3 GET per vertex (expensive) // Do: Batch reads in 100 MB chunks (amortize cost)
func (pm *PartitionManager) BatchLoadFromS3( vertexIDs []string, ) ([]*Vertex, error) { // Group by S3 object (partitions are 100 MB files) objectGroups := pm.GroupByS3Object(vertexIDs)
for object, ids := range objectGroups {
// Single S3 GET for entire 100 MB object
data := pm.S3Client.GetObject(object)
// Extract only needed vertices
vertices := pm.ExtractVertices(data, ids)
// Cache entire object for subsequent reads
pm.CacheObject(object, data)
}
// Cost: 1 S3 GET for 1000 vertices (vs 1000 GETs)
// Savings: 1000× reduction in request cost
}
### Revised Cost Model
| Component | Monthly Cost | Notes |
|-----------|--------------|-------|
| Hot Tier RAM | $583,000 | 1000 proxies |
| Cold Tier Storage | $4,347 | 189 TB S3 |
| **Proxy-Local Cache** | $0 | Included in proxy cost |
| **CloudFront CDN** | $2,600,000 | 98% cache hit |
| **S3 GET Requests** | $1,040,000 | 2% miss after CDN |
| Cross-AZ Bandwidth | $500,000 | From RFC-057 edits |
| Operational | $150,000 | Monitoring, logging |
| **Total** | **$4,877,347/month** | **$58.5M/year** |
**Key Insight**: True cost is **8× higher** than initial estimate, but still **44% cheaper** than pure in-memory baseline ($105M).
### Cost Optimization Roadmap
| Scale | Without CDN | With CDN | Savings |
|-------|-------------|----------|---------|
| 1B vertices | $58k/month | $58k/month | 0% (CDN not needed) |
| 10B vertices | $15M/month | $1.2M/month | 92% |
| 100B vertices | $150M/month | $4.9M/month | 97% |
**Recommendation**: Deploy CloudFront CDN **from day one** at 10B scale. Architecture doesn't work economically at 100B without it.
Edit 9: Add Snapshot Loading with WAL Replay (Finding 12)
Location: Add new Section 9.3 after Section 9.2 Add:
## Section 9.3: Snapshot Loading with WAL Replay
### The Version Skew Problem
Loading 210 TB snapshot takes 17 minutes. **What happens to writes during load?**
**Naive Approach** (broken):
T0: Start loading snapshot-2025-11-15 T5: New writes arrive → Where do they go? T17: Load complete → 17 minutes of writes lost!
### Solution: Dual-Version Loading
func (sl *SnapshotLoader) LoadWithWALReplay(snapshotPath string) error { // Step 1: Create shadow graph shadowGraph := sl.CreateShadowGraph()
// Step 2: Load snapshot into shadow (17 min)
snapshotTimestamp := sl.GetSnapshotTimestamp(snapshotPath)
sl.LoadSnapshot(snapshotPath, shadowGraph)
// Step 3: Replay WAL entries since snapshot
walEntries := sl.GetWALSince(snapshotTimestamp)
log.Infof("Replaying %d WAL entries from last 17 minutes",
len(walEntries))
for entry := range walEntries {
shadowGraph.Apply(entry)
}
// Step 4: Atomic switch
sl.mu.Lock()
oldGraph := sl.activeGraph
sl.activeGraph = shadowGraph
sl.mu.Unlock()
// Step 5: Cleanup old graph
oldGraph.Shutdown()
log.Infof("Snapshot load complete, switched to new graph")
return nil
}
### Query Behavior During Load
func (qe *QueryExecutor) ExecuteQuery(query *Query) (*Result, error) { // Queries always go to active graph qe.mu.RLock() graph := qe.activeGraph qe.mu.RUnlock()
return graph.Execute(query)
}
**Timeline**:
T0: Start load (old graph serves queries) T17: Load complete + WAL replay (old graph still serving) T17.001: Atomic switch (new graph serves queries) T17.1: Cleanup old graph
Downtime: 0 seconds (seamless transition)
### WAL Replay Performance
Scenario: 17 minutes of writes Write rate: 100k writes/sec Total writes: 17 × 60 × 100k = 102M writes Replay time: 102M / 1M per sec = 102 seconds
Total load time: 17 min (snapshot) + 1.7 min (WAL) = 18.7 minutes
### Consistency Guarantees
Before switch: All queries see old graph (consistent) After switch: All queries see new graph (consistent) No queries see partial state (atomic switch)
Writes during load:
- Go to old graph (via WAL)
- Replayed on new graph
- No writes lost
Edit 10: Add Hysteresis for Temperature Management (Finding 8)
Location: Lines 273-289 (temperature rules) Current:
temperature_rules:
hot: ">= 1000 requests/min"
warm: ">= 10 requests/min"
cold: "< 10 requests/min"
Change To:
temperature_rules:
hot:
promote_threshold: 1200 # Higher to promote (20% hysteresis)
demote_threshold: 800 # Lower to demote
cooldown_period: 300s # Wait 5 min before demotion
warm:
promote_threshold: 1000 # From cold to warm
demote_threshold: 5 # From warm to cold
cooldown_period: 600s # Wait 10 min
cold:
# No thresholds (bottom tier)
rationale: |
Hysteresis prevents thrashing for partitions near thresholds.
Example: Partition at 990-1010 req/min won't flip every minute.
Cooldown period ensures sustained temperature change before action.
example:
partition_requests: [990, 1010, 980, 1020, 1150, 1180, 1200, 1210]
without_hysteresis:
state_changes: [warm, hot, warm, hot, hot, hot, hot, hot]
promotions: 4 (expensive!)
with_hysteresis:
state_changes: [warm, warm, warm, warm, warm, warm, hot, hot]
promotions: 1 (correct behavior)
RFC-060: Distributed Gremlin Execution
Edit 11: Add Query Resource Limits (Finding 4)
Location: Add new Section 7 after Section 6 Add:
## Section 7: Query Resource Limits and Runaway Prevention
At 1000 nodes, a single bad query can exhaust cluster resources. Multi-layer limits are critical.
### 7.1 Query Configuration Limits
query_resource_limits:
Timing limits
default_timeout_seconds: 30 max_timeout_seconds: 300 warning_threshold_seconds: 10
Memory limits
max_memory_per_query: 1_073_741_824 # 1 GB max_intermediate_results: 10_000_000 # 10M vertices
Result limits
max_result_vertices: 1_000_000 # Hard cap max_fanout_per_hop: 10_000 # Max edges per vertex max_traversal_depth: 5 # Max hops
Parallelism limits
max_partitions_per_query: 1000 max_concurrent_rpcs: 100
Cost limits (multi-tenant)
max_cost_units: 1000 cost_per_vertex_scan: 1 cost_per_edge_traversal: 2 cost_per_partition_access: 10
### 7.2 Pre-Execution Complexity Analysis
func (qca *QueryComplexityAnalyzer) AnalyzeBeforeExecution( query *GremlinQuery, ) (*ComplexityEstimate, error) { estimate := &ComplexityEstimate{}
// Estimate vertices scanned
estimate.VerticesScanned = qca.EstimateVertexScan(query)
if estimate.VerticesScanned > qca.limits.MaxIntermediateResults {
return nil, ErrQueryTooComplex{
Reason: fmt.Sprintf("Estimated %d vertices exceeds limit",
estimate.VerticesScanned),
Suggestion: "Add more selective filters before traversal",
}
}
// Estimate traversal depth
estimate.MaxDepth = qca.EstimateTraversalDepth(query)
if estimate.MaxDepth > qca.limits.MaxTraversalDepth {
return nil, ErrQueryTooComplex{
Reason: fmt.Sprintf("Traversal depth %d exceeds limit",
estimate.MaxDepth),
Suggestion: "Reduce number of .out()/.in() steps",
}
}
// Calculate cost units
estimate.CostUnits = qca.CalculateCost(estimate)
if estimate.CostUnits > qca.limits.MaxCostUnits {
return nil, ErrQueryTooExpensive{
Cost: estimate.CostUnits,
Limit: qca.limits.MaxCostUnits,
}
}
return estimate, nil
}
### 7.3 Runtime Enforcement
func (qe *QueryExecutor) ExecuteWithLimits( ctx context.Context, query *GremlinQuery, limits *QueryResourceLimits, ) (*ResultStream, error) { // Step 1: Pre-execution check estimate, err := qe.complexityAnalyzer.AnalyzeBeforeExecution(query) if err != nil { return nil, err }
// Step 2: Timeout context
timeoutCtx, cancel := context.WithTimeout(ctx,
time.Duration(limits.DefaultTimeoutSeconds) * time.Second)
defer cancel()
// Step 3: Memory tracking
memTracker := NewMemoryTracker(limits.MaxMemoryPerQuery)
defer memTracker.Release()
// Step 4: Execute with checks
resultStream := NewResultStream()
go func() {
defer resultStream.Close()
vertexCount := 0
for stage := range qe.ExecuteStages(timeoutCtx, query) {
// Check timeout
select {
case <-timeoutCtx.Done():
resultStream.SendError(ErrQueryTimeout{
Duration: time.Since(query.StartTime),
})
return
default:
}
// Check memory
if memTracker.CurrentUsage() > limits.MaxMemoryPerQuery {
resultStream.SendError(ErrMemoryLimitExceeded{
Usage: memTracker.CurrentUsage(),
})
return
}
// Check vertex count
vertexCount += len(stage.Results)
if vertexCount > limits.MaxResultVertices {
resultStream.SendWarning(WarnResultTruncated{
Returned: limits.MaxResultVertices,
Actual: vertexCount,
})
return
}
// Stream results
for _, vertex := range stage.Results {
resultStream.Send(vertex)
}
}
}()
return resultStream, nil
}
### 7.4 Circuit Breaker
type CircuitBreaker struct { failureThreshold float64 // 0.1 = 10% failure rate recoveryTimeout time.Duration // 30 seconds state CircuitState }
func (cb *CircuitBreaker) RecordFailure() { cb.mu.Lock() defer cb.mu.Unlock()
cb.failures++
total := cb.failures + cb.successes
if total > 100 {
failureRate := float64(cb.failures) / float64(total)
if failureRate > cb.failureThreshold {
cb.state = CircuitOpen
cb.lastStateChange = time.Now()
log.Errorf("Circuit breaker opened: %.1f%% failure rate",
failureRate*100)
}
}
}
### 7.5 Query Admission Control
type QueryAdmissionController struct { maxConcurrentQueries int currentQueries int64 priorityLevels map[string]int }
func (qac *QueryAdmissionController) AdmitQuery( query *Query, principal *Principal, ) error { current := atomic.LoadInt64(&qac.currentQueries)
if current >= int64(qac.maxConcurrentQueries) {
priority := qac.priorityLevels[principal.TenantID]
if priority < 5 {
return ErrClusterOverloaded{
CurrentLoad: current,
RetryAfter: 5 * time.Second,
}
}
}
atomic.AddInt64(&qac.currentQueries, 1)
defer atomic.AddInt64(&qac.currentQueries, -1)
return nil
}
Edit 12: Add Super-Node Handling (Finding 2)
Location: Add new Section 6 after Section 5 Add:
## Section 6: Power-Law Graph Distribution and Super-Node Handling
Real-world graphs follow power-law distribution. Top 0.01% vertices (celebrities, hub accounts) have 1M+ edges.
### 6.1 Super-Node Classification
type VertexType int
const ( VERTEX_TYPE_NORMAL VertexType = iota // <10k edges (99%) VERTEX_TYPE_HUB_NODE // 10k-100k edges (1%) VERTEX_TYPE_SUPER_NODE // 100k-1M edges (0.01%) VERTEX_TYPE_MEGA_NODE // >1M edges (0.001%) )
func (pm *PartitionManager) ClassifyVertex(vertexID string) VertexType { degree := pm.GetVertexDegree(vertexID)
switch {
case degree > 1_000_000:
return VERTEX_TYPE_MEGA_NODE
case degree > 100_000:
return VERTEX_TYPE_SUPER_NODE
case degree > 10_000:
return VERTEX_TYPE_HUB_NODE
default:
return VERTEX_TYPE_NORMAL
}
}
### 6.2 Sampling Strategy
func (qe *QueryExecutor) TraverseWithSampling( vertexID string, edgeLabel string, maxResults int, ) ([]*Vertex, *SamplingMetadata, error) { degree := qe.GetVertexDegree(vertexID)
if degree <= 10_000 {
// Normal: Return all edges
edges := qe.GetAllEdges(vertexID, edgeLabel)
return edges, &SamplingMetadata{Sampled: false}, nil
}
if degree > 1_000_000 {
// Mega-node: Use HyperLogLog + random sample
cardinality := qe.EstimateCardinality(vertexID, edgeLabel)
sample := qe.RandomSample(vertexID, edgeLabel, maxResults)
return sample, &SamplingMetadata{
Sampled: true,
ActualDegree: cardinality,
Method: "random_sample",
}, nil
}
// Hub node: Return top-K by weight
topK := qe.GetTopKEdges(vertexID, edgeLabel, maxResults)
return topK, &SamplingMetadata{
Sampled: true,
ActualDegree: degree,
Method: "top_k",
}, nil
}
### 6.3 Gremlin Extensions for Approximation
// New: .approximate() step g.V('user:taylor_swift') .in('FOLLOWS') .approximate() // Enable sampling/approximation .count()
// Returns: ~100M (HyperLogLog estimate)
// New: .sample(N) step g.V('user:taylor_swift') .in('FOLLOWS') .sample(10000) // Random sample of 10k followers .has('country', 'USA') .count()
// Returns: 6,543 USA followers in sample // Extrapolate: ~65M total USA followers
### 6.4 Circuit Breaker for Super-Nodes
func (qe *QueryExecutor) ExecuteWithCircuitBreaker( query *Query, ) (*Result, error) { // Check vertex degree before execution startVertex := qe.ExtractStartVertex(query) degree := qe.GetVertexDegree(startVertex)
if degree > 1_000_000 {
// Require explicit .approximate() or .sample() step
if !query.HasApproximateMode() {
return nil, ErrSuperNodeQuery{
VertexID: startVertex,
Degree: degree,
Suggestion: "Use .approximate() or .sample(N)",
}
}
}
return qe.Execute(query)
}
Edit 13: Add Query Observability (Finding 11)
Location: Add new Section 10 after Section 9 Add:
## Section 10: Query Observability and Debugging
### 10.1 EXPLAIN Plan (SQL-style)
g.V().has('city', 'SF').out('FOLLOWS').values('name').explain()
// Output: Step 1: IndexScan (city='SF') Partitions: 150 of 16,000 (99.1% pruned) Estimated vertices: 5M Cost: 2s (parallel scan)
Step 2: EdgeTraversal (out:FOLLOWS) Fan-out: 50M edges Partitions: 5,000 (crossing boundaries) Cost: 10s (network RPC)
Step 3: PropertyExtraction (values:name) Memory tier: 60% (warm) S3 fetches: 2M (cold tier) Cost: 5s (S3 latency)
Total estimated cost: 17s
### 10.2 Query Timeline Visualization
[==IndexScan:2.1s==][====EdgeTraversal:12.3s====][=PropExtract:4.8s=] ^ Slow step (S3 fetches)
### 10.3 Distributed Tracing (OpenTelemetry)
func (qe *QueryExecutor) ExecuteWithTracing( ctx context.Context, query *GremlinQuery, ) (*Result, error) { // Create trace span tracer := otel.Tracer("query-executor") ctx, span := tracer.Start(ctx, "execute_gremlin_query") defer span.End()
// Add query attributes
span.SetAttributes(
attribute.String("query", query.Text),
attribute.Int64("estimated_vertices", query.EstimatedVertices),
)
// Execute stages with child spans
for _, stage := range query.Stages {
stageCtx, stageSpan := tracer.Start(ctx,
fmt.Sprintf("stage_%d", stage.Index))
result := qe.ExecuteStage(stageCtx, stage)
stageSpan.SetAttributes(
attribute.Int64("vertices_processed", result.Count),
attribute.Int64("partitions_queried", len(stage.Partitions)),
)
stageSpan.End()
}
return result, nil
}
### 10.4 Slow Query Log
slow_query_log: enabled: true threshold_seconds: 10 destination: elasticsearch
fields: - query_text - query_duration_ms - vertices_scanned - partitions_queried - principal_id - timestamp
example_entry: query: "g.V().out().out().out().count()" duration_ms: 45230 vertices_scanned: 125000000 partitions_queried: 8432 principal: "user@company.com" timestamp: "2025-11-15T10:30:00Z" diagnosis: "Too many hops without filters"
### 10.5 Metrics and Alerts
prometheus_metrics:
-
name: query_latency_seconds type: histogram labels: [query_type, complexity, tenant_id] buckets: [0.001, 0.01, 0.1, 1, 10, 30, 60, 300]
-
name: query_timeouts_total type: counter labels: [query_type, reason, tenant_id]
-
name: circuit_breaker_state type: gauge labels: [cluster_id]
alerts:
-
name: HighQueryTimeoutRate condition: rate(query_timeouts_total[5m]) > 0.1 severity: warning action: "Investigate slow queries"
-
name: CircuitBreakerOpen condition: circuit_breaker_state == 1 severity: critical action: "Cluster instability detected"
RFC-061: Graph Authorization
Edit 14: Add Batch Authorization Optimization (Finding 10)
Location: Add new Section 7.5 after Section 7.4 Add:
## Section 7.5: Batch Authorization Optimization
### The Performance Problem
Per-vertex authorization check (10 μs) × 1M vertex query = **10 seconds overhead** (unacceptable).
// Naive approach (slow): for _, vertex := range results { if !authz.CanAccessVertex(principal, vertex) { continue // Skip unauthorized } authorizedResults = append(authorizedResults, vertex) }
// Cost: 1M vertices × 10 μs = 10 seconds
### Solution: Bitmap-Based Batch Authorization
type ClearanceBitmapCache struct { // principal_id → partition_id → authorized (bitmap) cache map[string]*roaring.Bitmap }
func (cbc *ClearanceBitmapCache) GetAuthorizedPartitions( principalID string, ) *roaring.Bitmap { // Check cache if bitmap, exists := cbc.cache[principalID]; exists { return bitmap }
// Compute bitmap
principal := cbc.LoadPrincipal(principalID)
bitmap := roaring.New()
for _, partition := range cbc.GetAllPartitions() {
if cbc.IsPartitionAuthorized(partition, principal) {
bitmap.Add(uint32(partition.ID))
}
}
// Cache result
cbc.cache[principalID] = bitmap
return bitmap
}
func (qe *QueryExecutor) ExecuteWithBatchAuthz( principal *Principal, partitions []string, ) ([]*Vertex, error) { // Get authorized partitions bitmap (computed once) authorizedPartitions := qe.bitmapCache.GetAuthorizedPartitions( principal.ID)
// Filter partitions (fast bitmap check)
filteredPartitions := []string{}
for _, partitionID := range partitions {
if authorizedPartitions.Contains(uint32(partitionID)) {
filteredPartitions = append(filteredPartitions, partitionID)
}
}
// Query only authorized partitions
return qe.QueryPartitions(filteredPartitions)
}
// Cost: 1 bitmap computation + 1M bit tests // Time: 100 μs (compute) + 1M × 1 ns (test) = 1.1 ms // Speedup: 10s → 1ms = 10,000× faster!
### Partition-Level Authorization Filter
func (qp *QueryPlanner) PrunePartitionsWithAuthz( filter *Filter, principal *Principal, ) []string { // Step 1: Index-based pruning candidatePartitions := qp.PrunePartitionsWithIndex(filter)
// Step 2: Authorization pruning (bitmap)
authorizedBitmap := qp.bitmapCache.GetAuthorizedPartitions(
principal.ID)
// Step 3: Intersect
authorizedPartitions := []string{}
for _, partitionID := range candidatePartitions {
if authorizedBitmap.Contains(uint32(partitionID)) {
authorizedPartitions = append(authorizedPartitions, partitionID)
}
}
return authorizedPartitions
}
### Performance Comparison
| Query Size | Per-Vertex Authz | Bitmap Authz | Speedup |
|------------|------------------|--------------|---------|
| **1k vertices** | 10 ms | 1 ms | 10× |
| **10k vertices** | 100 ms | 1 ms | 100× |
| **100k vertices** | 1 s | 1.1 ms | 909× |
| **1M vertices** | 10 s | 1.1 ms | 9,090× |
### Cache Invalidation
func (cbc *ClearanceBitmapCache) InvalidateOnPrincipalUpdate( principalID string, ) { // Clear cached bitmap when principal clearances change delete(cbc.cache, principalID) }
func (cbc *ClearanceBitmapCache) InvalidateOnPartitionLabelChange( partitionID string, ) { // Clear all cached bitmaps (expensive but rare) cbc.cache = make(map[string]*roaring.Bitmap) }
**Key Insight**: At 100B scale, per-vertex authorization checks are infeasible. Bitmap-based batch authorization is **mandatory** for acceptable performance.
Edit 15: Add Audit Log Sampling (Finding 14)
Location: Lines 863-870 (Audit Log Throughput) Add After Table:
### Audit Log Sampling Strategy
**Problem**: 388 TB for 90 days of audit logs is excessive (and expensive).
**Solution**: Sample non-sensitive queries, always log sensitive/denied:
audit_sampling_config:
Always log (100%)
always_log: - event_type: access_denied - event_type: sensitive_access - resource_labels: [pii, financial, secret] - principal_types: [admin, root, service_account]
Sample normal queries (1%)
sample: rate: 0.01 # 1% of normal queries method: consistent_hash # Same query_id always sampled/not
Enhanced logging for sampled queries
include_query_plan: true include_timing_breakdown: true
**Implementation**:
func (al *AuditLogger) ShouldLog(event *AuditEvent) bool { // Always log denials if event.Decision == ACCESS_DECISION_DENY { return true }
// Always log sensitive access
if al.IsSensitiveAccess(event) {
return true
}
// Always log privileged principals
if al.IsPrivilegedPrincipal(event.PrincipalID) {
return true
}
// Sample normal queries (1%)
hash := xxhash.Sum64String(event.QueryID)
return (hash % 100) < 1 // 1% sample rate
}
**Cost Savings**:
Original: 388 TB for 90 days With 1% sampling: 3.88 TB (normal) + 10 TB (sensitive/denied) = 13.88 TB Savings: 96% reduction
Storage cost: 13.88 TB × $23/TB = $319/month (vs $8,924 original)
**Trade-offs**:
- **Pro**: 96% cost savings, focus on important events
- **Con**: Can't reconstruct all query history (only sample)
- **Acceptable**: Compliance requirements met (all sensitive/denied logged)
Summary
This document defines 15 specific edits across 5 RFCs to address the 18 findings from MEMO-050:
| RFC | Edits | Critical Fixes | High Priority | Medium Priority |
|---|---|---|---|---|
| RFC-057 | 5 edits | Network topology, Opaque IDs | Partition sizing, Hash function, Failure recovery | - |
| RFC-058 | 2 edits | Memory reconciliation | - | Index versioning |
| RFC-059 | 3 edits | S3 cost optimization | - | WAL replay, Hysteresis |
| RFC-060 | 3 edits | Query limits, Super-node handling | - | Observability |
| RFC-061 | 2 edits | - | Batch authorization | Audit sampling |
Estimated Effort: 15-20 engineer-days to implement all edits with proper testing and validation.
Priority Order:
- Week 1: Critical P0 edits (cost model, memory, limits)
- Week 2: High priority P1 edits (performance, reliability)
- Week 3: Medium priority P2 edits (operations, polish)
Next Steps:
- Review this summary with architecture team
- Assign edits to RFC owners
- Schedule RFC update sprint
- Update RFC status to "Draft - Revisions In Progress"
- Final review after all edits complete