Skip to main content

RFC Edit Summary for MEMO-050 Production Readiness

Parent Document: MEMO-050: Production Readiness Analysis Date: 2025-11-15 Author: Platform Team

This document summarizes all required edits to RFCs 057-061 based on the production readiness analysis in MEMO-050.


RFC-057: Massive-Scale Graph Sharding

Edit 1: Update Partition Sizing (Finding 6)

Location: Line 269 Current: partitions_per_proxy: 16 Change To:

partitions_per_proxy: 64  # Increased from 16 for finer granularity
vertices_per_partition: 1_562_500 # Down from 6.25M
partition_size_mb: 156 # Down from 625 MB

rationale: |
- Finer hot/cold temperature management
- Faster rebalancing (156 MB chunks vs 625 MB)
- Better load distribution (lower variance)
- Smaller failure blast radius

Edit 2: Replace CRC32 with xxHash (Finding 7)

Location: Lines 290-300 Current:

crc32("user:alice") % 10 → cluster_id

Change To:

import "github.com/cespare/xxhash/v2"

func (p *Partitioner) GetClusterID(vertexID string) int {
h := xxhash.Sum64String(vertexID)
return int(h % uint64(p.NumClusters))
}

// Benefits:
// - 1.7× faster than CRC32
// - 8× better distribution (1.8% variance vs 15%)
// - Industry standard (RocksDB, ClickHouse)

Edit 3: Add Network Topology Awareness (Finding 3)

Location: After Line 227 (end of Section 4) Add New Section 4.6:

### Section 4.6: Network Topology-Aware Sharding

#### Partition Metadata Extension

message PartitionMetadata { string partition_id = 1; string cluster_id = 2; string proxy_id = 3;

// Network topology (NEW) NetworkLocation location = 4; }

message NetworkLocation { string region = 1; // "us-west-2" string availability_zone = 2; // "us-west-2a" string rack_id = 3; // "rack-42" (optional)

// Network characteristics float cross_az_bandwidth_gbps = 4; float cross_region_bandwidth_gbps = 5; }


#### Network Cost Model

Cross-AZ bandwidth costs **$0.01/GB**, which dominates at scale:

| Network Path | Latency | Cost | Priority |
|--------------|---------|------|----------|
| Same rack | 100 μs | $0/GB | Highest |
| Same AZ | 200 μs | $0/GB | High |
| Cross-AZ | 1-2 ms | $0.01/GB | Medium |
| Cross-region | 20-100 ms | $0.02/GB | Low |

**Cost Impact**:
- Without optimization: $600M/year in cross-AZ bandwidth
- With AZ-aware placement: $6M/year (5% cross-AZ traffic)

#### Locality-Aware Placement Strategy

func (nap *NetworkAwarePartitioner) AssignPartition( vertex *Vertex, placementHint *PlacementHint, ) string { // If hint provided, prefer co-location in same AZ if placementHint != nil && placementHint.CoLocateWithVertex != "" { targetPartition := nap.GetPartitionForVertex(placementHint.CoLocateWithVertex) targetAZ := nap.GetAZ(targetPartition)

    // Find partition in same AZ
sameAZPartition := nap.FindPartitionInAZ(targetAZ)
if sameAZPartition != "" {
return sameAZPartition
}
}

// Fall back to consistent hashing
return nap.ConsistentHash(vertex.ID)

}


#### Multi-AZ Deployment Strategy

cluster_topology: region: us-west-2

availability_zones: - id: us-west-2a proxies: 334 primary_for: [cluster_0, cluster_3, cluster_6, cluster_9]

- id: us-west-2b
proxies: 333
primary_for: [cluster_1, cluster_4, cluster_7]

- id: us-west-2c
proxies: 333
primary_for: [cluster_2, cluster_5, cluster_8]

partition_replication: factor: 3 # 3 replicas per partition placement: different_az # Replicas in different AZs read_preference: primary_az # Read from local AZ first


#### Cost Savings

| Scale | Without AZ Awareness | With AZ Awareness | Savings |
|-------|---------------------|-------------------|---------|
| 10B vertices | $45M/year | $15M/year | 67% |
| 100B vertices | $600M/year | $30M/year | 95% |

**Recommendation**: Network-aware placement is **mandatory** at 100B scale.

Edit 4: Add Failure Detection and Recovery (Finding 9)

Location: Add new Section 7 after Section 6 Add:

## Section 7: Failure Detection and Recovery

At 1000 nodes, expect **~1 node failure per day** (typical MTBF). Comprehensive failure handling is critical.

### 7.1 Node Failure Detection

**Heartbeat Protocol**:

heartbeat_config: interval: 10s # Send heartbeat every 10 seconds timeout: 30s # Mark dead after 3 missed heartbeats grace_period: 60s # Wait before triggering recovery

detection_method: type: phi_accrual # Φ accrual failure detector threshold: 8.0 # Confidence level for failure


**Gossip Protocol**:

Each node maintains cluster membership:

  • Gossip interval: 1 second
  • Gossip fanout: 3 random nodes
  • Suspicion threshold: 3 nodes report failure

### 7.2 Partition Recovery Options

**Option A: Replica Failover** (Fast, 10 seconds)

replication_strategy: replicas_per_partition: 3 placement: different_az consistency: eventual

failover_process:

  1. Detect primary partition failure (30s)
  2. Elect new primary from replicas (10s)
  3. Update routing table (5s)
  4. Resume serving traffic (5s)

total_recovery_time: ~50 seconds data_loss: None (replicas up-to-date)


**Option B: S3 Restore** (Slow, 5 minutes)

s3_restore_strategy: trigger: No replicas available

restore_process: 1. Download partition snapshot from S3 (2 min) 2. Replay WAL since last snapshot (2 min) 3. Rebuild indexes (1 min) 4. Resume serving (30s)

total_recovery_time: ~5 minutes data_loss: None (WAL replay) use_case: All replicas failed (rare)


### 7.3 Cascading Failure Prevention

**Circuit Breaker**:

type CascadePrevent struct { failureThreshold float64 // 0.05 = 5% node failures actionTimeout time.Duration }

func (cp *CascadePrevent) MonitorClusterHealth() { failureRate := float64(cp.FailedNodes) / float64(cp.TotalNodes)

if failureRate > cp.failureThreshold {
// Stop rebalancing (avoid cascading load)
cp.PauseRebalancing()

// Enable load shedding (drop low-priority queries)
cp.EnableLoadShedding()

// Alert operations team
cp.AlertCritical("Cascading failure risk: %.1f%% nodes down",
failureRate*100)
}

}


**Backpressure**:

backpressure_policy: trigger: partition_unavailable_rate > 10%

actions: - slow_down_writes: 50% - reject_bulk_operations: true - increase_retry_delays: 2x - alert_clients: "degraded_mode"

recovery: condition: partition_unavailable_rate < 5% duration: sustained for 5 minutes


### 7.4 Operational Runbook

**Incident: Single Node Failure**

Detection: Heartbeat missed for 30s Action:

  1. Check replica availability
  2. If replicas healthy: Promote replica to primary (50s)
  3. If no replicas: Restore from S3 (5 min)
  4. Update monitoring dashboards
  5. Create JIRA for node replacement

**Incident: AZ Failure (33% of cluster)**

Detection: All nodes in AZ unreachable Action:

  1. Verify: AWS service health dashboard
  2. Failover: Promote replicas in other AZs (2 min)
  3. Load distribution: Redistribute traffic (5 min)
  4. Monitoring: Watch for cascading failures
  5. Capacity: Add temporary nodes if needed

**Incident: Cascading Failure (&gt;10% nodes)**

Detection: Circuit breaker triggered Action:

  1. Stop rebalancing immediately
  2. Enable load shedding (drop 20% lowest priority queries)
  3. Investigate root cause (memory leak? query storm?)
  4. Add capacity if needed
  5. Gradual recovery (don't overwhelm remaining nodes)

Edit 5: Add Opaque Vertex IDs (Finding 15)

Location: Lines 231-261 (Vertex ID Format section) Current: Hierarchical format cluster:proxy:partition:local_id Add Discussion:

#### Trade-off: Hierarchical vs Opaque IDs

**Current Design**: Hierarchical (encoded topology)

vertex_id: "02:0045:12:user:alice"


**Pros**:
- Fast routing (parse ID to get partition)
- No routing table lookup needed
- Deterministic placement

**Cons**:
- Rebalancing requires ID rewrite
- Clients depend on internal structure
- Cannot change topology without data migration

**Alternative**: Opaque IDs with routing table

vertex_id: "v_9f8e7d6c5b4a3918" # UUID routing_table["v_9f8e7d6c5b4a3918"] → partition_id: "02:0045:12"


**Pros**:
- Topology-independent IDs
- Free rebalancing (update routing table only)
- Clients don't depend on internal structure

**Cons**:
- Routing table lookup overhead (~1 μs)
- Routing table storage (~1 GB for 100B vertices)
- Complexity: Distributed routing table updates

**Recommendation for 100B scale**: Use **opaque IDs** with distributed routing table. The flexibility for rebalancing outweighs the 1 μs overhead.

Implementation:

type RoutingTable struct { entries map[string]string // vertex_id → partition_id cache *lru.Cache // Local LRU cache }

func (rt *RoutingTable) GetPartition(vertexID string) (string, error) { // Check local cache first if partitionID, ok := rt.cache.Get(vertexID); ok { return partitionID.(string), nil }

// Fall back to distributed table
partitionID, err := rt.lookupDistributed(vertexID)
if err != nil {
return "", err
}

// Cache result
rt.cache.Add(vertexID, partitionID)
return partitionID, nil

}


RFC-058: Multi-Level Graph Indexing

Edit 6: Add Index Tiering for Memory Reconciliation (Finding 5)

Location: Add new Section 6.5 after Section 6.4 Add:

## Section 6.5: Index Tiering Strategy

### Memory Capacity Reconciliation

**Problem**: RFC-057 allocates 30 TB total memory, but naive index storage requires:

Indexes: 16 TB (11.4% overhead from Section 4.4) Hot data: 21 TB (from RFC-059) Total: 37 TB

Available: 30 TB Deficit: 7 TB (19% over capacity)


**Solution**: Tier indexes like data (hot/warm/cold)

### Index Temperature Classification

type IndexTemperature int

const ( INDEX_HOT IndexTemperature = iota // In memory INDEX_WARM // On local SSD INDEX_COLD // On S3 )

func (im *IndexManager) ClassifyIndexTemperature( partitionID string, ) IndexTemperature { metrics := im.GetIndexMetrics(partitionID)

switch {
case metrics.QueriesPerMinute > 1000:
return INDEX_HOT // Keep in memory
case metrics.QueriesPerMinute > 10:
return INDEX_WARM // Move to SSD
default:
return INDEX_COLD // Move to S3
}

}


### Memory Allocation (Revised)

memory_budget: total_available: 30 TB

allocation: hot_indexes: 1.6 TB # 10% partitions × 16 GB = 1.6 TB hot_data: 15 TB # 7% of 210 TB (reduced from 10%) warm_cache: 5 TB # In-memory cache for warm tier query_buffers: 3 TB # Query execution memory os_overhead: 2 TB # OS, JVM, buffers reserved: 3.4 TB # Headroom for spikes

total_used: 26.6 TB utilization: 89% headroom: 11%


### Index Storage Locations

| Index Tier | Storage | Access Latency | Use Case |
|------------|---------|----------------|----------|
| **Hot** | RAM | 50 μs | Frequently queried partitions (&gt;1000 qpm) |
| **Warm** | SSD | 100 μs | Occasionally queried (10-1000 qpm) |
| **Cold** | S3 | 50-100 ms | Rarely queried (&lt;10 qpm) |

### Index Promotion/Demotion

func (im *IndexManager) ManageIndexTemperature() { ticker := time.NewTicker(60 * time.Second)

for range ticker.C {
for _, partition := range im.GetAllPartitions() {
currentTemp := partition.IndexTemperature
targetTemp := im.ClassifyIndexTemperature(partition.ID)

if currentTemp != targetTemp {
im.TransitionIndex(partition.ID, targetTemp)
}
}
}

}

func (im *IndexManager) TransitionIndex( partitionID string, targetTemp IndexTemperature, ) error { partition := im.GetPartition(partitionID)

switch targetTemp {
case INDEX_HOT:
// Load index from SSD/S3 to memory
return im.LoadIndexToMemory(partitionID)

case INDEX_WARM:
// Move index from memory to SSD
if partition.IndexTemperature == INDEX_HOT {
return im.EvictIndexToSSD(partitionID)
}
// Or load from S3 to SSD
return im.LoadIndexToSSD(partitionID)

case INDEX_COLD:
// Evict index to S3
return im.EvictIndexToS3(partitionID)
}

return nil

}


### Performance Trade-offs

| Query Type | Hot Index | Warm Index (SSD) | Cold Index (S3) |
|------------|-----------|------------------|-----------------|
| **Property lookup** | 50 μs | 150 μs (3× slower) | 100 ms (2000× slower) |
| **Range scan** | 2 ms | 10 ms (5× slower) | 2 s (1000× slower) |

**Key Insight**: At 100B scale, treating indexes as tiered resources (like data) is **essential** for memory efficiency.

Edit 7: Add Index Versioning (Finding 13)

Location: Line 175 (PartitionIndex protobuf) Current:

message PartitionIndex {
string partition_id = 1;
int64 vertex_count = 2;
int64 edge_count = 3;
int64 last_updated = 4;
...
}

Change To:

message PartitionIndex {
string partition_id = 1;
int32 schema_version = 2; // NEW: Track index format version
int64 vertex_count = 3;
int64 edge_count = 4;
int64 last_updated = 5;
...
}

// Version history:
// v1: Hash + Range indexes only
// v2: + Inverted indexes
// v3: + Edge indexes
// v4: + Bloom filters
// v5: + Super-node optimizations

Add Migration Strategy:

func (il *IndexLoader) LoadIndex(partitionID string) (*PartitionIndex, error) {
data, err := il.ReadIndexFile(partitionID)
if err != nil {
return nil, err
}

index := &PartitionIndex{}
proto.Unmarshal(data, index)

// Check version compatibility
switch index.SchemaVersion {
case 4:
// Current version, use directly
return index, nil

case 3:
// Older version, missing bloom filters
// Upgrade on-the-fly
return il.UpgradeFromV3ToV4(index)

case 2:
// Very old, missing edge indexes
return il.UpgradeFromV2ToV4(index)

default:
return nil, fmt.Errorf("unsupported index version: %d",
index.SchemaVersion)
}
}

RFC-059: Hot/Cold Storage Tiers

Edit 8: Add S3 Request Cost Optimization (Finding 1)

Location: After Section 3 (Cost Analysis) Add New Section 8.5:

## Section 8.5: S3 Request Cost Optimization

### The Hidden Cost of S3

**Problem**: Initial cost analysis (Section 3) only considered storage costs:

Storage: 189 TB × $23/TB = $4,347/month


**Reality**: S3 request costs dominate at scale:

Query load: 1B queries/sec (from RFC-060) Cache miss rate: 10% × 90% cold = 9% S3 GETs/sec: 90M Cost: 90M × ($0.0004/1000) = $36/second = $93.3M/month

Total: $93.3M storage + requests (159× higher than stated)


### Multi-Tier Caching Architecture

**Tier 0: Proxy-Local Cache** (New)

varnish_cache: size_per_proxy: 100 GB (local SSD) ttl: 3600s (1 hour) eviction: LRU

effect: cache_hit_rate: 95% (up from 90%) s3_requests: 50M/sec (down from 90M)


**Tier 1: CloudFront CDN**

cloudfront_config: price_class: PriceClass_100 (US/Europe/Asia) request_cost: $0.0075 per 10K requests (85× cheaper than S3)

effect: cache_hit_rate: 98% at CDN level s3_requests: 1M/sec (down from 50M) monthly_cost: $2.6M CDN + $1.04M S3 = $3.64M


**Tier 2: S3 Express One Zone** (for hot partitions)

s3_express_config: storage_cost: $0.16/GB-month (vs $0.023 Standard) latency: 5-10ms (vs 50-100ms Standard) use_case: Top 10% hottest partitions (21 TB)

trade_off: storage_cost: 21 TB × $160/TB = $3,360/month (7× higher) benefit: 10× lower latency for hot queries


**Tier 3: Batch S3 Reads**

// Instead of: 1 S3 GET per vertex (expensive) // Do: Batch reads in 100 MB chunks (amortize cost)

func (pm *PartitionManager) BatchLoadFromS3( vertexIDs []string, ) ([]*Vertex, error) { // Group by S3 object (partitions are 100 MB files) objectGroups := pm.GroupByS3Object(vertexIDs)

for object, ids := range objectGroups {
// Single S3 GET for entire 100 MB object
data := pm.S3Client.GetObject(object)

// Extract only needed vertices
vertices := pm.ExtractVertices(data, ids)

// Cache entire object for subsequent reads
pm.CacheObject(object, data)
}

// Cost: 1 S3 GET for 1000 vertices (vs 1000 GETs)
// Savings: 1000× reduction in request cost

}


### Revised Cost Model

| Component | Monthly Cost | Notes |
|-----------|--------------|-------|
| Hot Tier RAM | $583,000 | 1000 proxies |
| Cold Tier Storage | $4,347 | 189 TB S3 |
| **Proxy-Local Cache** | $0 | Included in proxy cost |
| **CloudFront CDN** | $2,600,000 | 98% cache hit |
| **S3 GET Requests** | $1,040,000 | 2% miss after CDN |
| Cross-AZ Bandwidth | $500,000 | From RFC-057 edits |
| Operational | $150,000 | Monitoring, logging |
| **Total** | **$4,877,347/month** | **$58.5M/year** |

**Key Insight**: True cost is **8× higher** than initial estimate, but still **44% cheaper** than pure in-memory baseline ($105M).

### Cost Optimization Roadmap

| Scale | Without CDN | With CDN | Savings |
|-------|-------------|----------|---------|
| 1B vertices | $58k/month | $58k/month | 0% (CDN not needed) |
| 10B vertices | $15M/month | $1.2M/month | 92% |
| 100B vertices | $150M/month | $4.9M/month | 97% |

**Recommendation**: Deploy CloudFront CDN **from day one** at 10B scale. Architecture doesn't work economically at 100B without it.

Edit 9: Add Snapshot Loading with WAL Replay (Finding 12)

Location: Add new Section 9.3 after Section 9.2 Add:

## Section 9.3: Snapshot Loading with WAL Replay

### The Version Skew Problem

Loading 210 TB snapshot takes 17 minutes. **What happens to writes during load?**

**Naive Approach** (broken):

T0: Start loading snapshot-2025-11-15 T5: New writes arrive → Where do they go? T17: Load complete → 17 minutes of writes lost!


### Solution: Dual-Version Loading

func (sl *SnapshotLoader) LoadWithWALReplay(snapshotPath string) error { // Step 1: Create shadow graph shadowGraph := sl.CreateShadowGraph()

// Step 2: Load snapshot into shadow (17 min)
snapshotTimestamp := sl.GetSnapshotTimestamp(snapshotPath)
sl.LoadSnapshot(snapshotPath, shadowGraph)

// Step 3: Replay WAL entries since snapshot
walEntries := sl.GetWALSince(snapshotTimestamp)

log.Infof("Replaying %d WAL entries from last 17 minutes",
len(walEntries))

for entry := range walEntries {
shadowGraph.Apply(entry)
}

// Step 4: Atomic switch
sl.mu.Lock()
oldGraph := sl.activeGraph
sl.activeGraph = shadowGraph
sl.mu.Unlock()

// Step 5: Cleanup old graph
oldGraph.Shutdown()

log.Infof("Snapshot load complete, switched to new graph")
return nil

}


### Query Behavior During Load

func (qe *QueryExecutor) ExecuteQuery(query *Query) (*Result, error) { // Queries always go to active graph qe.mu.RLock() graph := qe.activeGraph qe.mu.RUnlock()

return graph.Execute(query)

}


**Timeline**:

T0: Start load (old graph serves queries) T17: Load complete + WAL replay (old graph still serving) T17.001: Atomic switch (new graph serves queries) T17.1: Cleanup old graph

Downtime: 0 seconds (seamless transition)


### WAL Replay Performance

Scenario: 17 minutes of writes Write rate: 100k writes/sec Total writes: 17 × 60 × 100k = 102M writes Replay time: 102M / 1M per sec = 102 seconds

Total load time: 17 min (snapshot) + 1.7 min (WAL) = 18.7 minutes


### Consistency Guarantees

Before switch: All queries see old graph (consistent) After switch: All queries see new graph (consistent) No queries see partial state (atomic switch)

Writes during load:

  • Go to old graph (via WAL)
  • Replayed on new graph
  • No writes lost

Edit 10: Add Hysteresis for Temperature Management (Finding 8)

Location: Lines 273-289 (temperature rules) Current:

temperature_rules:
hot: ">= 1000 requests/min"
warm: ">= 10 requests/min"
cold: "< 10 requests/min"

Change To:

temperature_rules:
hot:
promote_threshold: 1200 # Higher to promote (20% hysteresis)
demote_threshold: 800 # Lower to demote
cooldown_period: 300s # Wait 5 min before demotion

warm:
promote_threshold: 1000 # From cold to warm
demote_threshold: 5 # From warm to cold
cooldown_period: 600s # Wait 10 min

cold:
# No thresholds (bottom tier)

rationale: |
Hysteresis prevents thrashing for partitions near thresholds.
Example: Partition at 990-1010 req/min won't flip every minute.
Cooldown period ensures sustained temperature change before action.

example:
partition_requests: [990, 1010, 980, 1020, 1150, 1180, 1200, 1210]

without_hysteresis:
state_changes: [warm, hot, warm, hot, hot, hot, hot, hot]
promotions: 4 (expensive!)

with_hysteresis:
state_changes: [warm, warm, warm, warm, warm, warm, hot, hot]
promotions: 1 (correct behavior)

RFC-060: Distributed Gremlin Execution

Edit 11: Add Query Resource Limits (Finding 4)

Location: Add new Section 7 after Section 6 Add:

## Section 7: Query Resource Limits and Runaway Prevention

At 1000 nodes, a single bad query can exhaust cluster resources. Multi-layer limits are critical.

### 7.1 Query Configuration Limits

query_resource_limits:

Timing limits

default_timeout_seconds: 30 max_timeout_seconds: 300 warning_threshold_seconds: 10

Memory limits

max_memory_per_query: 1_073_741_824 # 1 GB max_intermediate_results: 10_000_000 # 10M vertices

Result limits

max_result_vertices: 1_000_000 # Hard cap max_fanout_per_hop: 10_000 # Max edges per vertex max_traversal_depth: 5 # Max hops

Parallelism limits

max_partitions_per_query: 1000 max_concurrent_rpcs: 100

Cost limits (multi-tenant)

max_cost_units: 1000 cost_per_vertex_scan: 1 cost_per_edge_traversal: 2 cost_per_partition_access: 10


### 7.2 Pre-Execution Complexity Analysis

func (qca *QueryComplexityAnalyzer) AnalyzeBeforeExecution( query *GremlinQuery, ) (*ComplexityEstimate, error) { estimate := &ComplexityEstimate{}

// Estimate vertices scanned
estimate.VerticesScanned = qca.EstimateVertexScan(query)
if estimate.VerticesScanned > qca.limits.MaxIntermediateResults {
return nil, ErrQueryTooComplex{
Reason: fmt.Sprintf("Estimated %d vertices exceeds limit",
estimate.VerticesScanned),
Suggestion: "Add more selective filters before traversal",
}
}

// Estimate traversal depth
estimate.MaxDepth = qca.EstimateTraversalDepth(query)
if estimate.MaxDepth > qca.limits.MaxTraversalDepth {
return nil, ErrQueryTooComplex{
Reason: fmt.Sprintf("Traversal depth %d exceeds limit",
estimate.MaxDepth),
Suggestion: "Reduce number of .out()/.in() steps",
}
}

// Calculate cost units
estimate.CostUnits = qca.CalculateCost(estimate)
if estimate.CostUnits > qca.limits.MaxCostUnits {
return nil, ErrQueryTooExpensive{
Cost: estimate.CostUnits,
Limit: qca.limits.MaxCostUnits,
}
}

return estimate, nil

}


### 7.3 Runtime Enforcement

func (qe *QueryExecutor) ExecuteWithLimits( ctx context.Context, query *GremlinQuery, limits *QueryResourceLimits, ) (*ResultStream, error) { // Step 1: Pre-execution check estimate, err := qe.complexityAnalyzer.AnalyzeBeforeExecution(query) if err != nil { return nil, err }

// Step 2: Timeout context
timeoutCtx, cancel := context.WithTimeout(ctx,
time.Duration(limits.DefaultTimeoutSeconds) * time.Second)
defer cancel()

// Step 3: Memory tracking
memTracker := NewMemoryTracker(limits.MaxMemoryPerQuery)
defer memTracker.Release()

// Step 4: Execute with checks
resultStream := NewResultStream()

go func() {
defer resultStream.Close()

vertexCount := 0

for stage := range qe.ExecuteStages(timeoutCtx, query) {
// Check timeout
select {
case <-timeoutCtx.Done():
resultStream.SendError(ErrQueryTimeout{
Duration: time.Since(query.StartTime),
})
return
default:
}

// Check memory
if memTracker.CurrentUsage() > limits.MaxMemoryPerQuery {
resultStream.SendError(ErrMemoryLimitExceeded{
Usage: memTracker.CurrentUsage(),
})
return
}

// Check vertex count
vertexCount += len(stage.Results)
if vertexCount > limits.MaxResultVertices {
resultStream.SendWarning(WarnResultTruncated{
Returned: limits.MaxResultVertices,
Actual: vertexCount,
})
return
}

// Stream results
for _, vertex := range stage.Results {
resultStream.Send(vertex)
}
}
}()

return resultStream, nil

}


### 7.4 Circuit Breaker

type CircuitBreaker struct { failureThreshold float64 // 0.1 = 10% failure rate recoveryTimeout time.Duration // 30 seconds state CircuitState }

func (cb *CircuitBreaker) RecordFailure() { cb.mu.Lock() defer cb.mu.Unlock()

cb.failures++

total := cb.failures + cb.successes
if total > 100 {
failureRate := float64(cb.failures) / float64(total)

if failureRate > cb.failureThreshold {
cb.state = CircuitOpen
cb.lastStateChange = time.Now()

log.Errorf("Circuit breaker opened: %.1f%% failure rate",
failureRate*100)
}
}

}


### 7.5 Query Admission Control

type QueryAdmissionController struct { maxConcurrentQueries int currentQueries int64 priorityLevels map[string]int }

func (qac *QueryAdmissionController) AdmitQuery( query *Query, principal *Principal, ) error { current := atomic.LoadInt64(&qac.currentQueries)

if current >= int64(qac.maxConcurrentQueries) {
priority := qac.priorityLevels[principal.TenantID]

if priority < 5 {
return ErrClusterOverloaded{
CurrentLoad: current,
RetryAfter: 5 * time.Second,
}
}
}

atomic.AddInt64(&qac.currentQueries, 1)
defer atomic.AddInt64(&qac.currentQueries, -1)

return nil

}

Edit 12: Add Super-Node Handling (Finding 2)

Location: Add new Section 6 after Section 5 Add:

## Section 6: Power-Law Graph Distribution and Super-Node Handling

Real-world graphs follow power-law distribution. Top 0.01% vertices (celebrities, hub accounts) have 1M+ edges.

### 6.1 Super-Node Classification

type VertexType int

const ( VERTEX_TYPE_NORMAL VertexType = iota // <10k edges (99%) VERTEX_TYPE_HUB_NODE // 10k-100k edges (1%) VERTEX_TYPE_SUPER_NODE // 100k-1M edges (0.01%) VERTEX_TYPE_MEGA_NODE // >1M edges (0.001%) )

func (pm *PartitionManager) ClassifyVertex(vertexID string) VertexType { degree := pm.GetVertexDegree(vertexID)

switch {
case degree > 1_000_000:
return VERTEX_TYPE_MEGA_NODE
case degree > 100_000:
return VERTEX_TYPE_SUPER_NODE
case degree > 10_000:
return VERTEX_TYPE_HUB_NODE
default:
return VERTEX_TYPE_NORMAL
}

}


### 6.2 Sampling Strategy

func (qe *QueryExecutor) TraverseWithSampling( vertexID string, edgeLabel string, maxResults int, ) ([]*Vertex, *SamplingMetadata, error) { degree := qe.GetVertexDegree(vertexID)

if degree <= 10_000 {
// Normal: Return all edges
edges := qe.GetAllEdges(vertexID, edgeLabel)
return edges, &SamplingMetadata{Sampled: false}, nil
}

if degree > 1_000_000 {
// Mega-node: Use HyperLogLog + random sample
cardinality := qe.EstimateCardinality(vertexID, edgeLabel)
sample := qe.RandomSample(vertexID, edgeLabel, maxResults)

return sample, &SamplingMetadata{
Sampled: true,
ActualDegree: cardinality,
Method: "random_sample",
}, nil
}

// Hub node: Return top-K by weight
topK := qe.GetTopKEdges(vertexID, edgeLabel, maxResults)

return topK, &SamplingMetadata{
Sampled: true,
ActualDegree: degree,
Method: "top_k",
}, nil

}


### 6.3 Gremlin Extensions for Approximation

// New: .approximate() step g.V('user:taylor_swift') .in('FOLLOWS') .approximate() // Enable sampling/approximation .count()

// Returns: ~100M (HyperLogLog estimate)

// New: .sample(N) step g.V('user:taylor_swift') .in('FOLLOWS') .sample(10000) // Random sample of 10k followers .has('country', 'USA') .count()

// Returns: 6,543 USA followers in sample // Extrapolate: ~65M total USA followers


### 6.4 Circuit Breaker for Super-Nodes

func (qe *QueryExecutor) ExecuteWithCircuitBreaker( query *Query, ) (*Result, error) { // Check vertex degree before execution startVertex := qe.ExtractStartVertex(query) degree := qe.GetVertexDegree(startVertex)

if degree > 1_000_000 {
// Require explicit .approximate() or .sample() step
if !query.HasApproximateMode() {
return nil, ErrSuperNodeQuery{
VertexID: startVertex,
Degree: degree,
Suggestion: "Use .approximate() or .sample(N)",
}
}
}

return qe.Execute(query)

}

Edit 13: Add Query Observability (Finding 11)

Location: Add new Section 10 after Section 9 Add:

## Section 10: Query Observability and Debugging

### 10.1 EXPLAIN Plan (SQL-style)

g.V().has('city', 'SF').out('FOLLOWS').values('name').explain()

// Output: Step 1: IndexScan (city='SF') Partitions: 150 of 16,000 (99.1% pruned) Estimated vertices: 5M Cost: 2s (parallel scan)

Step 2: EdgeTraversal (out:FOLLOWS) Fan-out: 50M edges Partitions: 5,000 (crossing boundaries) Cost: 10s (network RPC)

Step 3: PropertyExtraction (values:name) Memory tier: 60% (warm) S3 fetches: 2M (cold tier) Cost: 5s (S3 latency)

Total estimated cost: 17s


### 10.2 Query Timeline Visualization

[==IndexScan:2.1s==][====EdgeTraversal:12.3s====][=PropExtract:4.8s=] ^ Slow step (S3 fetches)


### 10.3 Distributed Tracing (OpenTelemetry)

func (qe *QueryExecutor) ExecuteWithTracing( ctx context.Context, query *GremlinQuery, ) (*Result, error) { // Create trace span tracer := otel.Tracer("query-executor") ctx, span := tracer.Start(ctx, "execute_gremlin_query") defer span.End()

// Add query attributes
span.SetAttributes(
attribute.String("query", query.Text),
attribute.Int64("estimated_vertices", query.EstimatedVertices),
)

// Execute stages with child spans
for _, stage := range query.Stages {
stageCtx, stageSpan := tracer.Start(ctx,
fmt.Sprintf("stage_%d", stage.Index))

result := qe.ExecuteStage(stageCtx, stage)

stageSpan.SetAttributes(
attribute.Int64("vertices_processed", result.Count),
attribute.Int64("partitions_queried", len(stage.Partitions)),
)
stageSpan.End()
}

return result, nil

}


### 10.4 Slow Query Log

slow_query_log: enabled: true threshold_seconds: 10 destination: elasticsearch

fields: - query_text - query_duration_ms - vertices_scanned - partitions_queried - principal_id - timestamp

example_entry: query: "g.V().out().out().out().count()" duration_ms: 45230 vertices_scanned: 125000000 partitions_queried: 8432 principal: "user@company.com" timestamp: "2025-11-15T10:30:00Z" diagnosis: "Too many hops without filters"


### 10.5 Metrics and Alerts

prometheus_metrics:

  • name: query_latency_seconds type: histogram labels: [query_type, complexity, tenant_id] buckets: [0.001, 0.01, 0.1, 1, 10, 30, 60, 300]

  • name: query_timeouts_total type: counter labels: [query_type, reason, tenant_id]

  • name: circuit_breaker_state type: gauge labels: [cluster_id]

alerts:

  • name: HighQueryTimeoutRate condition: rate(query_timeouts_total[5m]) > 0.1 severity: warning action: "Investigate slow queries"

  • name: CircuitBreakerOpen condition: circuit_breaker_state == 1 severity: critical action: "Cluster instability detected"


RFC-061: Graph Authorization

Edit 14: Add Batch Authorization Optimization (Finding 10)

Location: Add new Section 7.5 after Section 7.4 Add:

## Section 7.5: Batch Authorization Optimization

### The Performance Problem

Per-vertex authorization check (10 μs) × 1M vertex query = **10 seconds overhead** (unacceptable).

// Naive approach (slow): for _, vertex := range results { if !authz.CanAccessVertex(principal, vertex) { continue // Skip unauthorized } authorizedResults = append(authorizedResults, vertex) }

// Cost: 1M vertices × 10 μs = 10 seconds


### Solution: Bitmap-Based Batch Authorization

type ClearanceBitmapCache struct { // principal_id → partition_id → authorized (bitmap) cache map[string]*roaring.Bitmap }

func (cbc *ClearanceBitmapCache) GetAuthorizedPartitions( principalID string, ) *roaring.Bitmap { // Check cache if bitmap, exists := cbc.cache[principalID]; exists { return bitmap }

// Compute bitmap
principal := cbc.LoadPrincipal(principalID)
bitmap := roaring.New()

for _, partition := range cbc.GetAllPartitions() {
if cbc.IsPartitionAuthorized(partition, principal) {
bitmap.Add(uint32(partition.ID))
}
}

// Cache result
cbc.cache[principalID] = bitmap
return bitmap

}

func (qe *QueryExecutor) ExecuteWithBatchAuthz( principal *Principal, partitions []string, ) ([]*Vertex, error) { // Get authorized partitions bitmap (computed once) authorizedPartitions := qe.bitmapCache.GetAuthorizedPartitions( principal.ID)

// Filter partitions (fast bitmap check)
filteredPartitions := []string{}
for _, partitionID := range partitions {
if authorizedPartitions.Contains(uint32(partitionID)) {
filteredPartitions = append(filteredPartitions, partitionID)
}
}

// Query only authorized partitions
return qe.QueryPartitions(filteredPartitions)

}

// Cost: 1 bitmap computation + 1M bit tests // Time: 100 μs (compute) + 1M × 1 ns (test) = 1.1 ms // Speedup: 10s → 1ms = 10,000× faster!


### Partition-Level Authorization Filter

func (qp *QueryPlanner) PrunePartitionsWithAuthz( filter *Filter, principal *Principal, ) []string { // Step 1: Index-based pruning candidatePartitions := qp.PrunePartitionsWithIndex(filter)

// Step 2: Authorization pruning (bitmap)
authorizedBitmap := qp.bitmapCache.GetAuthorizedPartitions(
principal.ID)

// Step 3: Intersect
authorizedPartitions := []string{}
for _, partitionID := range candidatePartitions {
if authorizedBitmap.Contains(uint32(partitionID)) {
authorizedPartitions = append(authorizedPartitions, partitionID)
}
}

return authorizedPartitions

}


### Performance Comparison

| Query Size | Per-Vertex Authz | Bitmap Authz | Speedup |
|------------|------------------|--------------|---------|
| **1k vertices** | 10 ms | 1 ms | 10× |
| **10k vertices** | 100 ms | 1 ms | 100× |
| **100k vertices** | 1 s | 1.1 ms | 909× |
| **1M vertices** | 10 s | 1.1 ms | 9,090× |

### Cache Invalidation

func (cbc *ClearanceBitmapCache) InvalidateOnPrincipalUpdate( principalID string, ) { // Clear cached bitmap when principal clearances change delete(cbc.cache, principalID) }

func (cbc *ClearanceBitmapCache) InvalidateOnPartitionLabelChange( partitionID string, ) { // Clear all cached bitmaps (expensive but rare) cbc.cache = make(map[string]*roaring.Bitmap) }


**Key Insight**: At 100B scale, per-vertex authorization checks are infeasible. Bitmap-based batch authorization is **mandatory** for acceptable performance.

Edit 15: Add Audit Log Sampling (Finding 14)

Location: Lines 863-870 (Audit Log Throughput) Add After Table:

### Audit Log Sampling Strategy

**Problem**: 388 TB for 90 days of audit logs is excessive (and expensive).

**Solution**: Sample non-sensitive queries, always log sensitive/denied:

audit_sampling_config:

Always log (100%)

always_log: - event_type: access_denied - event_type: sensitive_access - resource_labels: [pii, financial, secret] - principal_types: [admin, root, service_account]

Sample normal queries (1%)

sample: rate: 0.01 # 1% of normal queries method: consistent_hash # Same query_id always sampled/not

Enhanced logging for sampled queries

include_query_plan: true include_timing_breakdown: true


**Implementation**:

func (al *AuditLogger) ShouldLog(event *AuditEvent) bool { // Always log denials if event.Decision == ACCESS_DECISION_DENY { return true }

// Always log sensitive access
if al.IsSensitiveAccess(event) {
return true
}

// Always log privileged principals
if al.IsPrivilegedPrincipal(event.PrincipalID) {
return true
}

// Sample normal queries (1%)
hash := xxhash.Sum64String(event.QueryID)
return (hash % 100) < 1 // 1% sample rate

}


**Cost Savings**:

Original: 388 TB for 90 days With 1% sampling: 3.88 TB (normal) + 10 TB (sensitive/denied) = 13.88 TB Savings: 96% reduction

Storage cost: 13.88 TB × $23/TB = $319/month (vs $8,924 original)


**Trade-offs**:
- **Pro**: 96% cost savings, focus on important events
- **Con**: Can't reconstruct all query history (only sample)
- **Acceptable**: Compliance requirements met (all sensitive/denied logged)

Summary

This document defines 15 specific edits across 5 RFCs to address the 18 findings from MEMO-050:

RFCEditsCritical FixesHigh PriorityMedium Priority
RFC-0575 editsNetwork topology, Opaque IDsPartition sizing, Hash function, Failure recovery-
RFC-0582 editsMemory reconciliation-Index versioning
RFC-0593 editsS3 cost optimization-WAL replay, Hysteresis
RFC-0603 editsQuery limits, Super-node handling-Observability
RFC-0612 edits-Batch authorizationAudit sampling

Estimated Effort: 15-20 engineer-days to implement all edits with proper testing and validation.

Priority Order:

  1. Week 1: Critical P0 edits (cost model, memory, limits)
  2. Week 2: High priority P1 edits (performance, reliability)
  3. Week 3: Medium priority P2 edits (operations, polish)

Next Steps:

  1. Review this summary with architecture team
  2. Assign edits to RFC owners
  3. Schedule RFC update sprint
  4. Update RFC status to "Draft - Revisions In Progress"
  5. Final review after all edits complete