graphproductionrfc-updates

Author: Platform TeamCreated: Nov 15, 2025Updated: Nov 15, 2025

RFC Edit Summary for MEMO-050 Production Readiness

Parent Document: MEMO-050: Production Readiness Analysis Date: 2025-11-15 Author: Platform Team

This document summarizes all required edits to RFCs 057-061 based on the production readiness analysis in MEMO-050.

RFC-057: Massive-Scale Graph Sharding

Edit 1: Update Partition Sizing (Finding 6)

Location: Line 269 Current: partitions_per_proxy: 16 Change To:

partitions_per_proxy: 64  # Increased from 16 for finer granularity
vertices_per_partition: 1_562_500  # Down from 6.25M
partition_size_mb: 156  # Down from 625 MB

rationale: |
  - Finer hot/cold temperature management
  - Faster rebalancing (156 MB chunks vs 625 MB)
  - Better load distribution (lower variance)
  - Smaller failure blast radius

Edit 2: Replace CRC32 with xxHash (Finding 7)

Location: Lines 290-300 Current:

crc32("user:alice") % 10 → cluster_id

Change To:

import "github.com/cespare/xxhash/v2"

func (p *Partitioner) GetClusterID(vertexID string) int {
    h := xxhash.Sum64String(vertexID)
    return int(h % uint64(p.NumClusters))
}

// Benefits:
// - 1.7× faster than CRC32
// - 8× better distribution (1.8% variance vs 15%)
// - Industry standard (RocksDB, ClickHouse)

Edit 3: Add Network Topology Awareness (Finding 3)

Location: After Line 227 (end of Section 4) Add New Section 4.6:

### Section 4.6: Network Topology-Aware Sharding

#### Partition Metadata Extension

message PartitionMetadata { string partition_id = 1; string cluster_id = 2; string proxy_id = 3;

// Network topology (NEW) NetworkLocation location = 4; }

message NetworkLocation { string region = 1; // "us-west-2" string availability_zone = 2; // "us-west-2a" string rack_id = 3; // "rack-42" (optional)

// Network characteristics float cross_az_bandwidth_gbps = 4; float cross_region_bandwidth_gbps = 5; }

#### Network Cost Model

Cross-AZ bandwidth costs **$0.01/GB**, which dominates at scale:

| Network Path | Latency | Cost | Priority |
|--------------|---------|------|----------|
| Same rack | 100 μs | $0/GB | Highest |
| Same AZ | 200 μs | $0/GB | High |
| Cross-AZ | 1-2 ms | $0.01/GB | Medium |
| Cross-region | 20-100 ms | $0.02/GB | Low |

**Cost Impact**:
- Without optimization: $600M/year in cross-AZ bandwidth
- With AZ-aware placement: $6M/year (5% cross-AZ traffic)

#### Locality-Aware Placement Strategy

func (nap *NetworkAwarePartitioner) AssignPartition( vertex *Vertex, placementHint *PlacementHint, ) string { // If hint provided, prefer co-location in same AZ if placementHint != nil && placementHint.CoLocateWithVertex != "" { targetPartition := nap.GetPartitionForVertex(placementHint.CoLocateWithVertex) targetAZ := nap.GetAZ(targetPartition)

    // Find partition in same AZ
    sameAZPartition := nap.FindPartitionInAZ(targetAZ)
    if sameAZPartition != "" {
        return sameAZPartition
    }
}

// Fall back to consistent hashing
return nap.ConsistentHash(vertex.ID)

}

#### Multi-AZ Deployment Strategy

cluster_topology: region: us-west-2

availability_zones: - id: us-west-2a proxies: 334 primary_for: [cluster_0, cluster_3, cluster_6, cluster_9]

- id: us-west-2b
  proxies: 333
  primary_for: [cluster_1, cluster_4, cluster_7]

- id: us-west-2c
  proxies: 333
  primary_for: [cluster_2, cluster_5, cluster_8]

partition_replication: factor: 3 # 3 replicas per partition placement: different_az # Replicas in different AZs read_preference: primary_az # Read from local AZ first

#### Cost Savings

| Scale | Without AZ Awareness | With AZ Awareness | Savings |
|-------|---------------------|-------------------|---------|
| 10B vertices | $45M/year | $15M/year | 67% |
| 100B vertices | $600M/year | $30M/year | 95% |

**Recommendation**: Network-aware placement is **mandatory** at 100B scale.

Edit 4: Add Failure Detection and Recovery (Finding 9)

Location: Add new Section 7 after Section 6 Add:

## Section 7: Failure Detection and Recovery

At 1000 nodes, expect **~1 node failure per day** (typical MTBF). Comprehensive failure handling is critical.

### 7.1 Node Failure Detection

**Heartbeat Protocol**:

heartbeat_config: interval: 10s # Send heartbeat every 10 seconds timeout: 30s # Mark dead after 3 missed heartbeats grace_period: 60s # Wait before triggering recovery

detection_method: type: phi_accrual # Φ accrual failure detector threshold: 8.0 # Confidence level for failure


**Gossip Protocol**:

Each node maintains cluster membership:

Gossip interval: 1 second
Gossip fanout: 3 random nodes
Suspicion threshold: 3 nodes report failure

### 7.2 Partition Recovery Options

**Option A: Replica Failover** (Fast, 10 seconds)

replication_strategy: replicas_per_partition: 3 placement: different_az consistency: eventual

failover_process:

Detect primary partition failure (30s)
Elect new primary from replicas (10s)
Update routing table (5s)
Resume serving traffic (5s)

total_recovery_time: ~50 seconds data_loss: None (replicas up-to-date)

**Option B: S3 Restore** (Slow, 5 minutes)

s3_restore_strategy: trigger: No replicas available

restore_process: 1. Download partition snapshot from S3 (2 min) 2. Replay WAL since last snapshot (2 min) 3. Rebuild indexes (1 min) 4. Resume serving (30s)

total_recovery_time: ~5 minutes data_loss: None (WAL replay) use_case: All replicas failed (rare)

### 7.3 Cascading Failure Prevention

**Circuit Breaker**:

type CascadePrevent struct { failureThreshold float64 // 0.05 = 5% node failures actionTimeout time.Duration }

func (cp *CascadePrevent) MonitorClusterHealth() { failureRate := float64(cp.FailedNodes) / float64(cp.TotalNodes)

if failureRate > cp.failureThreshold {
    // Stop rebalancing (avoid cascading load)
    cp.PauseRebalancing()

    // Enable load shedding (drop low-priority queries)
    cp.EnableLoadShedding()

    // Alert operations team
    cp.AlertCritical("Cascading failure risk: %.1f%% nodes down",
        failureRate*100)
}

}


**Backpressure**:

backpressure_policy: trigger: partition_unavailable_rate > 10%

actions: - slow_down_writes: 50% - reject_bulk_operations: true - increase_retry_delays: 2x - alert_clients: "degraded_mode"

recovery: condition: partition_unavailable_rate < 5% duration: sustained for 5 minutes

### 7.4 Operational Runbook

**Incident: Single Node Failure**

Detection: Heartbeat missed for 30s Action:

Check replica availability
If replicas healthy: Promote replica to primary (50s)
If no replicas: Restore from S3 (5 min)
Update monitoring dashboards
Create JIRA for node replacement


**Incident: AZ Failure (33% of cluster)**

Detection: All nodes in AZ unreachable Action:

Verify: AWS service health dashboard
Failover: Promote replicas in other AZs (2 min)
Load distribution: Redistribute traffic (5 min)
Monitoring: Watch for cascading failures
Capacity: Add temporary nodes if needed


**Incident: Cascading Failure (&gt;10% nodes)**

Detection: Circuit breaker triggered Action:

Stop rebalancing immediately
Enable load shedding (drop 20% lowest priority queries)
Investigate root cause (memory leak? query storm?)
Add capacity if needed
Gradual recovery (don't overwhelm remaining nodes)

Edit 5: Add Opaque Vertex IDs (Finding 15)

Location: Lines 231-261 (Vertex ID Format section) Current: Hierarchical format cluster:proxy:partition:local_id Add Discussion:

#### Trade-off: Hierarchical vs Opaque IDs

**Current Design**: Hierarchical (encoded topology)

vertex_id: "02:0045:12:user:alice"

**Pros**:
- Fast routing (parse ID to get partition)
- No routing table lookup needed
- Deterministic placement

**Cons**:
- Rebalancing requires ID rewrite
- Clients depend on internal structure
- Cannot change topology without data migration

**Alternative**: Opaque IDs with routing table

vertex_id: "v_9f8e7d6c5b4a3918" # UUID routing_table["v_9f8e7d6c5b4a3918"] → partition_id: "02:0045:12"

**Pros**:
- Topology-independent IDs
- Free rebalancing (update routing table only)
- Clients don't depend on internal structure

**Cons**:
- Routing table lookup overhead (~1 μs)
- Routing table storage (~1 GB for 100B vertices)
- Complexity: Distributed routing table updates

**Recommendation for 100B scale**: Use **opaque IDs** with distributed routing table. The flexibility for rebalancing outweighs the 1 μs overhead.

Implementation:

type RoutingTable struct { entries map[string]string // vertex_id → partition_id cache *lru.Cache // Local LRU cache }

func (rt *RoutingTable) GetPartition(vertexID string) (string, error) { // Check local cache first if partitionID, ok := rt.cache.Get(vertexID); ok { return partitionID.(string), nil }

// Fall back to distributed table
partitionID, err := rt.lookupDistributed(vertexID)
if err != nil {
    return "", err
}

// Cache result
rt.cache.Add(vertexID, partitionID)
return partitionID, nil

}

RFC-058: Multi-Level Graph Indexing

Edit 6: Add Index Tiering for Memory Reconciliation (Finding 5)

Location: Add new Section 6.5 after Section 6.4 Add:

## Section 6.5: Index Tiering Strategy

### Memory Capacity Reconciliation

**Problem**: RFC-057 allocates 30 TB total memory, but naive index storage requires:

Indexes: 16 TB (11.4% overhead from Section 4.4) Hot data: 21 TB (from RFC-059) Total: 37 TB

Available: 30 TB Deficit: 7 TB (19% over capacity)

**Solution**: Tier indexes like data (hot/warm/cold)

### Index Temperature Classification

type IndexTemperature int

const ( INDEX_HOT IndexTemperature = iota // In memory INDEX_WARM // On local SSD INDEX_COLD // On S3 )

func (im *IndexManager) ClassifyIndexTemperature( partitionID string, ) IndexTemperature { metrics := im.GetIndexMetrics(partitionID)

switch {
case metrics.QueriesPerMinute > 1000:
    return INDEX_HOT    // Keep in memory
case metrics.QueriesPerMinute > 10:
    return INDEX_WARM   // Move to SSD
default:
    return INDEX_COLD   // Move to S3
}

}

### Memory Allocation (Revised)

memory_budget: total_available: 30 TB

allocation: hot_indexes: 1.6 TB # 10% partitions × 16 GB = 1.6 TB hot_data: 15 TB # 7% of 210 TB (reduced from 10%) warm_cache: 5 TB # In-memory cache for warm tier query_buffers: 3 TB # Query execution memory os_overhead: 2 TB # OS, JVM, buffers reserved: 3.4 TB # Headroom for spikes

total_used: 26.6 TB utilization: 89% headroom: 11%

### Index Storage Locations

| Index Tier | Storage | Access Latency | Use Case |
|------------|---------|----------------|----------|
| **Hot** | RAM | 50 μs | Frequently queried partitions (&gt;1000 qpm) |
| **Warm** | SSD | 100 μs | Occasionally queried (10-1000 qpm) |
| **Cold** | S3 | 50-100 ms | Rarely queried (&lt;10 qpm) |

### Index Promotion/Demotion

func (im *IndexManager) ManageIndexTemperature() { ticker := time.NewTicker(60 * time.Second)

for range ticker.C {
    for _, partition := range im.GetAllPartitions() {
        currentTemp := partition.IndexTemperature
        targetTemp := im.ClassifyIndexTemperature(partition.ID)

        if currentTemp != targetTemp {
            im.TransitionIndex(partition.ID, targetTemp)
        }
    }
}

}

func (im *IndexManager) TransitionIndex( partitionID string, targetTemp IndexTemperature, ) error { partition := im.GetPartition(partitionID)

switch targetTemp {
case INDEX_HOT:
    // Load index from SSD/S3 to memory
    return im.LoadIndexToMemory(partitionID)

case INDEX_WARM:
    // Move index from memory to SSD
    if partition.IndexTemperature == INDEX_HOT {
        return im.EvictIndexToSSD(partitionID)
    }
    // Or load from S3 to SSD
    return im.LoadIndexToSSD(partitionID)

case INDEX_COLD:
    // Evict index to S3
    return im.EvictIndexToS3(partitionID)
}

return nil

}

### Performance Trade-offs

| Query Type | Hot Index | Warm Index (SSD) | Cold Index (S3) |
|------------|-----------|------------------|-----------------|
| **Property lookup** | 50 μs | 150 μs (3× slower) | 100 ms (2000× slower) |
| **Range scan** | 2 ms | 10 ms (5× slower) | 2 s (1000× slower) |

**Key Insight**: At 100B scale, treating indexes as tiered resources (like data) is **essential** for memory efficiency.

Edit 7: Add Index Versioning (Finding 13)

Location: Line 175 (PartitionIndex protobuf) Current:

message PartitionIndex {
  string partition_id = 1;
  int64 vertex_count = 2;
  int64 edge_count = 3;
  int64 last_updated = 4;
  ...
}

Change To:

message PartitionIndex {
  string partition_id = 1;
  int32 schema_version = 2;    // NEW: Track index format version
  int64 vertex_count = 3;
  int64 edge_count = 4;
  int64 last_updated = 5;
  ...
}

// Version history:
// v1: Hash + Range indexes only
// v2: + Inverted indexes
// v3: + Edge indexes
// v4: + Bloom filters
// v5: + Super-node optimizations

Add Migration Strategy:

func (il *IndexLoader) LoadIndex(partitionID string) (*PartitionIndex, error) {
    data, err := il.ReadIndexFile(partitionID)
    if err != nil {
        return nil, err
    }

    index := &PartitionIndex{}
    proto.Unmarshal(data, index)

    // Check version compatibility
    switch index.SchemaVersion {
    case 4:
        // Current version, use directly
        return index, nil

    case 3:
        // Older version, missing bloom filters
        // Upgrade on-the-fly
        return il.UpgradeFromV3ToV4(index)

    case 2:
        // Very old, missing edge indexes
        return il.UpgradeFromV2ToV4(index)

    default:
        return nil, fmt.Errorf("unsupported index version: %d",
            index.SchemaVersion)
    }
}

RFC-059: Hot/Cold Storage Tiers

Edit 8: Add S3 Request Cost Optimization (Finding 1)

Location: After Section 3 (Cost Analysis) Add New Section 8.5:

## Section 8.5: S3 Request Cost Optimization

### The Hidden Cost of S3

**Problem**: Initial cost analysis (Section 3) only considered storage costs:

Storage: 189 TB × $23/TB = $4,347/month


**Reality**: S3 request costs dominate at scale:

Query load: 1B queries/sec (from RFC-060) Cache miss rate: 10% × 90% cold = 9% S3 GETs/sec: 90M Cost: 90M × ($0.0004/1000) = $36/second = $93.3M/month

Total: $93.3M storage + requests (159× higher than stated)

### Multi-Tier Caching Architecture

**Tier 0: Proxy-Local Cache** (New)

varnish_cache: size_per_proxy: 100 GB (local SSD) ttl: 3600s (1 hour) eviction: LRU

effect: cache_hit_rate: 95% (up from 90%) s3_requests: 50M/sec (down from 90M)


**Tier 1: CloudFront CDN**

cloudfront_config: price_class: PriceClass_100 (US/Europe/Asia) request_cost: $0.0075 per 10K requests (85× cheaper than S3)

effect: cache_hit_rate: 98% at CDN level s3_requests: 1M/sec (down from 50M) monthly_cost: $2.6M CDN + $1.04M S3 = $3.64M


**Tier 2: S3 Express One Zone** (for hot partitions)

s3_express_config: storage_cost: $0.16/GB-month (vs $0.023 Standard) latency: 5-10ms (vs 50-100ms Standard) use_case: Top 10% hottest partitions (21 TB)

trade_off: storage_cost: 21 TB × $160/TB = $3,360/month (7× higher) benefit: 10× lower latency for hot queries


**Tier 3: Batch S3 Reads**

// Instead of: 1 S3 GET per vertex (expensive) // Do: Batch reads in 100 MB chunks (amortize cost)

func (pm *PartitionManager) BatchLoadFromS3( vertexIDs []string, ) ([]*Vertex, error) { // Group by S3 object (partitions are 100 MB files) objectGroups := pm.GroupByS3Object(vertexIDs)

for object, ids := range objectGroups {
    // Single S3 GET for entire 100 MB object
    data := pm.S3Client.GetObject(object)

    // Extract only needed vertices
    vertices := pm.ExtractVertices(data, ids)

    // Cache entire object for subsequent reads
    pm.CacheObject(object, data)
}

// Cost: 1 S3 GET for 1000 vertices (vs 1000 GETs)
// Savings: 1000× reduction in request cost

}

### Revised Cost Model

| Component | Monthly Cost | Notes |
|-----------|--------------|-------|
| Hot Tier RAM | $583,000 | 1000 proxies |
| Cold Tier Storage | $4,347 | 189 TB S3 |
| **Proxy-Local Cache** | $0 | Included in proxy cost |
| **CloudFront CDN** | $2,600,000 | 98% cache hit |
| **S3 GET Requests** | $1,040,000 | 2% miss after CDN |
| Cross-AZ Bandwidth | $500,000 | From RFC-057 edits |
| Operational | $150,000 | Monitoring, logging |
| **Total** | **$4,877,347/month** | **$58.5M/year** |

**Key Insight**: True cost is **8× higher** than initial estimate, but still **44% cheaper** than pure in-memory baseline ($105M).

### Cost Optimization Roadmap

| Scale | Without CDN | With CDN | Savings |
|-------|-------------|----------|---------|
| 1B vertices | $58k/month | $58k/month | 0% (CDN not needed) |
| 10B vertices | $15M/month | $1.2M/month | 92% |
| 100B vertices | $150M/month | $4.9M/month | 97% |

**Recommendation**: Deploy CloudFront CDN **from day one** at 10B scale. Architecture doesn't work economically at 100B without it.

Edit 9: Add Snapshot Loading with WAL Replay (Finding 12)

Location: Add new Section 9.3 after Section 9.2 Add:

## Section 9.3: Snapshot Loading with WAL Replay

### The Version Skew Problem

Loading 210 TB snapshot takes 17 minutes. **What happens to writes during load?**

**Naive Approach** (broken):

T0: Start loading snapshot-2025-11-15 T5: New writes arrive → Where do they go? T17: Load complete → 17 minutes of writes lost!

### Solution: Dual-Version Loading

func (sl *SnapshotLoader) LoadWithWALReplay(snapshotPath string) error { // Step 1: Create shadow graph shadowGraph := sl.CreateShadowGraph()

// Step 2: Load snapshot into shadow (17 min)
snapshotTimestamp := sl.GetSnapshotTimestamp(snapshotPath)
sl.LoadSnapshot(snapshotPath, shadowGraph)

// Step 3: Replay WAL entries since snapshot
walEntries := sl.GetWALSince(snapshotTimestamp)

log.Infof("Replaying %d WAL entries from last 17 minutes",
    len(walEntries))

for entry := range walEntries {
    shadowGraph.Apply(entry)
}

// Step 4: Atomic switch
sl.mu.Lock()
oldGraph := sl.activeGraph
sl.activeGraph = shadowGraph
sl.mu.Unlock()

// Step 5: Cleanup old graph
oldGraph.Shutdown()

log.Infof("Snapshot load complete, switched to new graph")
return nil

}

### Query Behavior During Load

func (qe *QueryExecutor) ExecuteQuery(query *Query) (*Result, error) { // Queries always go to active graph qe.mu.RLock() graph := qe.activeGraph qe.mu.RUnlock()

return graph.Execute(query)

}


**Timeline**:

T0: Start load (old graph serves queries) T17: Load complete + WAL replay (old graph still serving) T17.001: Atomic switch (new graph serves queries) T17.1: Cleanup old graph

Downtime: 0 seconds (seamless transition)

### WAL Replay Performance

Scenario: 17 minutes of writes Write rate: 100k writes/sec Total writes: 17 × 60 × 100k = 102M writes Replay time: 102M / 1M per sec = 102 seconds

Total load time: 17 min (snapshot) + 1.7 min (WAL) = 18.7 minutes

### Consistency Guarantees

Before switch: All queries see old graph (consistent) After switch: All queries see new graph (consistent) No queries see partial state (atomic switch)

Writes during load:

Go to old graph (via WAL)
Replayed on new graph
No writes lost

Edit 10: Add Hysteresis for Temperature Management (Finding 8)

Location: Lines 273-289 (temperature rules) Current:

temperature_rules:
  hot: ">= 1000 requests/min"
  warm: ">= 10 requests/min"
  cold: "< 10 requests/min"

Change To:

temperature_rules:
  hot:
    promote_threshold: 1200        # Higher to promote (20% hysteresis)
    demote_threshold: 800          # Lower to demote
    cooldown_period: 300s          # Wait 5 min before demotion

  warm:
    promote_threshold: 1000        # From cold to warm
    demote_threshold: 5            # From warm to cold
    cooldown_period: 600s          # Wait 10 min

  cold:
    # No thresholds (bottom tier)

rationale: |
  Hysteresis prevents thrashing for partitions near thresholds.
  Example: Partition at 990-1010 req/min won't flip every minute.
  Cooldown period ensures sustained temperature change before action.

example:
  partition_requests: [990, 1010, 980, 1020, 1150, 1180, 1200, 1210]

  without_hysteresis:
    state_changes: [warm, hot, warm, hot, hot, hot, hot, hot]
    promotions: 4 (expensive!)

  with_hysteresis:
    state_changes: [warm, warm, warm, warm, warm, warm, hot, hot]
    promotions: 1 (correct behavior)

RFC-060: Distributed Gremlin Execution

Edit 11: Add Query Resource Limits (Finding 4)

Location: Add new Section 7 after Section 6 Add:

## Section 7: Query Resource Limits and Runaway Prevention

At 1000 nodes, a single bad query can exhaust cluster resources. Multi-layer limits are critical.

### 7.1 Query Configuration Limits

query_resource_limits:

Timing limits

default_timeout_seconds: 30 max_timeout_seconds: 300 warning_threshold_seconds: 10

Memory limits

max_memory_per_query: 1_073_741_824 # 1 GB max_intermediate_results: 10_000_000 # 10M vertices

Result limits

max_result_vertices: 1_000_000 # Hard cap max_fanout_per_hop: 10_000 # Max edges per vertex max_traversal_depth: 5 # Max hops

Parallelism limits

max_partitions_per_query: 1000 max_concurrent_rpcs: 100

Cost limits (multi-tenant)

max_cost_units: 1000 cost_per_vertex_scan: 1 cost_per_edge_traversal: 2 cost_per_partition_access: 10

### 7.2 Pre-Execution Complexity Analysis

func (qca *QueryComplexityAnalyzer) AnalyzeBeforeExecution( query *GremlinQuery, ) (*ComplexityEstimate, error) { estimate := &ComplexityEstimate{}

// Estimate vertices scanned
estimate.VerticesScanned = qca.EstimateVertexScan(query)
if estimate.VerticesScanned > qca.limits.MaxIntermediateResults {
    return nil, ErrQueryTooComplex{
        Reason: fmt.Sprintf("Estimated %d vertices exceeds limit",
            estimate.VerticesScanned),
        Suggestion: "Add more selective filters before traversal",
    }
}

// Estimate traversal depth
estimate.MaxDepth = qca.EstimateTraversalDepth(query)
if estimate.MaxDepth > qca.limits.MaxTraversalDepth {
    return nil, ErrQueryTooComplex{
        Reason: fmt.Sprintf("Traversal depth %d exceeds limit",
            estimate.MaxDepth),
        Suggestion: "Reduce number of .out()/.in() steps",
    }
}

// Calculate cost units
estimate.CostUnits = qca.CalculateCost(estimate)
if estimate.CostUnits > qca.limits.MaxCostUnits {
    return nil, ErrQueryTooExpensive{
        Cost: estimate.CostUnits,
        Limit: qca.limits.MaxCostUnits,
    }
}

return estimate, nil

}

### 7.3 Runtime Enforcement

func (qe *QueryExecutor) ExecuteWithLimits( ctx context.Context, query *GremlinQuery, limits *QueryResourceLimits, ) (*ResultStream, error) { // Step 1: Pre-execution check estimate, err := qe.complexityAnalyzer.AnalyzeBeforeExecution(query) if err != nil { return nil, err }

// Step 2: Timeout context
timeoutCtx, cancel := context.WithTimeout(ctx,
    time.Duration(limits.DefaultTimeoutSeconds) * time.Second)
defer cancel()

// Step 3: Memory tracking
memTracker := NewMemoryTracker(limits.MaxMemoryPerQuery)
defer memTracker.Release()

// Step 4: Execute with checks
resultStream := NewResultStream()

go func() {
    defer resultStream.Close()

    vertexCount := 0

    for stage := range qe.ExecuteStages(timeoutCtx, query) {
        // Check timeout
        select {
        case <-timeoutCtx.Done():
            resultStream.SendError(ErrQueryTimeout{
                Duration: time.Since(query.StartTime),
            })
            return
        default:
        }

        // Check memory
        if memTracker.CurrentUsage() > limits.MaxMemoryPerQuery {
            resultStream.SendError(ErrMemoryLimitExceeded{
                Usage: memTracker.CurrentUsage(),
            })
            return
        }

        // Check vertex count
        vertexCount += len(stage.Results)
        if vertexCount > limits.MaxResultVertices {
            resultStream.SendWarning(WarnResultTruncated{
                Returned: limits.MaxResultVertices,
                Actual: vertexCount,
            })
            return
        }

        // Stream results
        for _, vertex := range stage.Results {
            resultStream.Send(vertex)
        }
    }
}()

return resultStream, nil

}

### 7.4 Circuit Breaker

type CircuitBreaker struct { failureThreshold float64 // 0.1 = 10% failure rate recoveryTimeout time.Duration // 30 seconds state CircuitState }

func (cb *CircuitBreaker) RecordFailure() { cb.mu.Lock() defer cb.mu.Unlock()

cb.failures++

total := cb.failures + cb.successes
if total > 100 {
    failureRate := float64(cb.failures) / float64(total)

    if failureRate > cb.failureThreshold {
        cb.state = CircuitOpen
        cb.lastStateChange = time.Now()

        log.Errorf("Circuit breaker opened: %.1f%% failure rate",
            failureRate*100)
    }
}

}

### 7.5 Query Admission Control

type QueryAdmissionController struct { maxConcurrentQueries int currentQueries int64 priorityLevels map[string]int }

func (qac *QueryAdmissionController) AdmitQuery( query *Query, principal *Principal, ) error { current := atomic.LoadInt64(&qac.currentQueries)

if current >= int64(qac.maxConcurrentQueries) {
    priority := qac.priorityLevels[principal.TenantID]

    if priority < 5 {
        return ErrClusterOverloaded{
            CurrentLoad: current,
            RetryAfter: 5 * time.Second,
        }
    }
}

atomic.AddInt64(&qac.currentQueries, 1)
defer atomic.AddInt64(&qac.currentQueries, -1)

return nil

}

Edit 12: Add Super-Node Handling (Finding 2)

Location: Add new Section 6 after Section 5 Add:

## Section 6: Power-Law Graph Distribution and Super-Node Handling

Real-world graphs follow power-law distribution. Top 0.01% vertices (celebrities, hub accounts) have 1M+ edges.

### 6.1 Super-Node Classification

type VertexType int

const ( VERTEX_TYPE_NORMAL VertexType = iota // <10k edges (99%) VERTEX_TYPE_HUB_NODE // 10k-100k edges (1%) VERTEX_TYPE_SUPER_NODE // 100k-1M edges (0.01%) VERTEX_TYPE_MEGA_NODE // >1M edges (0.001%) )

func (pm *PartitionManager) ClassifyVertex(vertexID string) VertexType { degree := pm.GetVertexDegree(vertexID)

switch {
case degree > 1_000_000:
    return VERTEX_TYPE_MEGA_NODE
case degree > 100_000:
    return VERTEX_TYPE_SUPER_NODE
case degree > 10_000:
    return VERTEX_TYPE_HUB_NODE
default:
    return VERTEX_TYPE_NORMAL
}

}

### 6.2 Sampling Strategy

func (qe *QueryExecutor) TraverseWithSampling( vertexID string, edgeLabel string, maxResults int, ) ([]*Vertex, *SamplingMetadata, error) { degree := qe.GetVertexDegree(vertexID)

if degree <= 10_000 {
    // Normal: Return all edges
    edges := qe.GetAllEdges(vertexID, edgeLabel)
    return edges, &SamplingMetadata{Sampled: false}, nil
}

if degree > 1_000_000 {
    // Mega-node: Use HyperLogLog + random sample
    cardinality := qe.EstimateCardinality(vertexID, edgeLabel)
    sample := qe.RandomSample(vertexID, edgeLabel, maxResults)

    return sample, &SamplingMetadata{
        Sampled: true,
        ActualDegree: cardinality,
        Method: "random_sample",
    }, nil
}

// Hub node: Return top-K by weight
topK := qe.GetTopKEdges(vertexID, edgeLabel, maxResults)

return topK, &SamplingMetadata{
    Sampled: true,
    ActualDegree: degree,
    Method: "top_k",
}, nil

}

### 6.3 Gremlin Extensions for Approximation

// New: .approximate() step g.V('user:taylor_swift') .in('FOLLOWS') .approximate() // Enable sampling/approximation .count()

// Returns: ~100M (HyperLogLog estimate)

// New: .sample(N) step g.V('user:taylor_swift') .in('FOLLOWS') .sample(10000) // Random sample of 10k followers .has('country', 'USA') .count()

// Returns: 6,543 USA followers in sample // Extrapolate: ~65M total USA followers

### 6.4 Circuit Breaker for Super-Nodes

func (qe *QueryExecutor) ExecuteWithCircuitBreaker( query *Query, ) (*Result, error) { // Check vertex degree before execution startVertex := qe.ExtractStartVertex(query) degree := qe.GetVertexDegree(startVertex)

if degree > 1_000_000 {
    // Require explicit .approximate() or .sample() step
    if !query.HasApproximateMode() {
        return nil, ErrSuperNodeQuery{
            VertexID: startVertex,
            Degree: degree,
            Suggestion: "Use .approximate() or .sample(N)",
        }
    }
}

return qe.Execute(query)

}

Edit 13: Add Query Observability (Finding 11)

Location: Add new Section 10 after Section 9 Add:

## Section 10: Query Observability and Debugging

### 10.1 EXPLAIN Plan (SQL-style)

g.V().has('city', 'SF').out('FOLLOWS').values('name').explain()

// Output: Step 1: IndexScan (city='SF') Partitions: 150 of 16,000 (99.1% pruned) Estimated vertices: 5M Cost: 2s (parallel scan)

Step 2: EdgeTraversal (out:FOLLOWS) Fan-out: 50M edges Partitions: 5,000 (crossing boundaries) Cost: 10s (network RPC)

Step 3: PropertyExtraction (values:name) Memory tier: 60% (warm) S3 fetches: 2M (cold tier) Cost: 5s (S3 latency)

Total estimated cost: 17s

### 10.2 Query Timeline Visualization

[==IndexScan:2.1s==][====EdgeTraversal:12.3s====][=PropExtract:4.8s=] ^ Slow step (S3 fetches)

### 10.3 Distributed Tracing (OpenTelemetry)

func (qe *QueryExecutor) ExecuteWithTracing( ctx context.Context, query *GremlinQuery, ) (*Result, error) { // Create trace span tracer := otel.Tracer("query-executor") ctx, span := tracer.Start(ctx, "execute_gremlin_query") defer span.End()

// Add query attributes
span.SetAttributes(
    attribute.String("query", query.Text),
    attribute.Int64("estimated_vertices", query.EstimatedVertices),
)

// Execute stages with child spans
for _, stage := range query.Stages {
    stageCtx, stageSpan := tracer.Start(ctx,
        fmt.Sprintf("stage_%d", stage.Index))

    result := qe.ExecuteStage(stageCtx, stage)

    stageSpan.SetAttributes(
        attribute.Int64("vertices_processed", result.Count),
        attribute.Int64("partitions_queried", len(stage.Partitions)),
    )
    stageSpan.End()
}

return result, nil

}

### 10.4 Slow Query Log

slow_query_log: enabled: true threshold_seconds: 10 destination: elasticsearch

fields: - query_text - query_duration_ms - vertices_scanned - partitions_queried - principal_id - timestamp

example_entry: query: "g.V().out().out().out().count()" duration_ms: 45230 vertices_scanned: 125000000 partitions_queried: 8432 principal: "user@company.com" timestamp: "2025-11-15T10:30:00Z" diagnosis: "Too many hops without filters"

### 10.5 Metrics and Alerts

prometheus_metrics:

name: query_latency_seconds type: histogram labels: [query_type, complexity, tenant_id] buckets: [0.001, 0.01, 0.1, 1, 10, 30, 60, 300]
name: query_timeouts_total type: counter labels: [query_type, reason, tenant_id]
name: circuit_breaker_state type: gauge labels: [cluster_id]

alerts:

name: HighQueryTimeoutRate condition: rate(query_timeouts_total[5m]) > 0.1 severity: warning action: "Investigate slow queries"
name: CircuitBreakerOpen condition: circuit_breaker_state == 1 severity: critical action: "Cluster instability detected"

RFC-061: Graph Authorization

Edit 14: Add Batch Authorization Optimization (Finding 10)

Location: Add new Section 7.5 after Section 7.4 Add:

## Section 7.5: Batch Authorization Optimization

### The Performance Problem

Per-vertex authorization check (10 μs) × 1M vertex query = **10 seconds overhead** (unacceptable).

// Naive approach (slow): for _, vertex := range results { if !authz.CanAccessVertex(principal, vertex) { continue // Skip unauthorized } authorizedResults = append(authorizedResults, vertex) }

// Cost: 1M vertices × 10 μs = 10 seconds

### Solution: Bitmap-Based Batch Authorization

type ClearanceBitmapCache struct { // principal_id → partition_id → authorized (bitmap) cache map[string]*roaring.Bitmap }

func (cbc *ClearanceBitmapCache) GetAuthorizedPartitions( principalID string, ) *roaring.Bitmap { // Check cache if bitmap, exists := cbc.cache[principalID]; exists { return bitmap }

// Compute bitmap
principal := cbc.LoadPrincipal(principalID)
bitmap := roaring.New()

for _, partition := range cbc.GetAllPartitions() {
    if cbc.IsPartitionAuthorized(partition, principal) {
        bitmap.Add(uint32(partition.ID))
    }
}

// Cache result
cbc.cache[principalID] = bitmap
return bitmap

}

func (qe *QueryExecutor) ExecuteWithBatchAuthz( principal *Principal, partitions []string, ) ([]*Vertex, error) { // Get authorized partitions bitmap (computed once) authorizedPartitions := qe.bitmapCache.GetAuthorizedPartitions( principal.ID)

// Filter partitions (fast bitmap check)
filteredPartitions := []string{}
for _, partitionID := range partitions {
    if authorizedPartitions.Contains(uint32(partitionID)) {
        filteredPartitions = append(filteredPartitions, partitionID)
    }
}

// Query only authorized partitions
return qe.QueryPartitions(filteredPartitions)

}

// Cost: 1 bitmap computation + 1M bit tests // Time: 100 μs (compute) + 1M × 1 ns (test) = 1.1 ms // Speedup: 10s → 1ms = 10,000× faster!

### Partition-Level Authorization Filter

func (qp *QueryPlanner) PrunePartitionsWithAuthz( filter *Filter, principal *Principal, ) []string { // Step 1: Index-based pruning candidatePartitions := qp.PrunePartitionsWithIndex(filter)

// Step 2: Authorization pruning (bitmap)
authorizedBitmap := qp.bitmapCache.GetAuthorizedPartitions(
    principal.ID)

// Step 3: Intersect
authorizedPartitions := []string{}
for _, partitionID := range candidatePartitions {
    if authorizedBitmap.Contains(uint32(partitionID)) {
        authorizedPartitions = append(authorizedPartitions, partitionID)
    }
}

return authorizedPartitions

}

### Performance Comparison

| Query Size | Per-Vertex Authz | Bitmap Authz | Speedup |
|------------|------------------|--------------|---------|
| **1k vertices** | 10 ms | 1 ms | 10× |
| **10k vertices** | 100 ms | 1 ms | 100× |
| **100k vertices** | 1 s | 1.1 ms | 909× |
| **1M vertices** | 10 s | 1.1 ms | 9,090× |

### Cache Invalidation

func (cbc *ClearanceBitmapCache) InvalidateOnPrincipalUpdate( principalID string, ) { // Clear cached bitmap when principal clearances change delete(cbc.cache, principalID) }

func (cbc *ClearanceBitmapCache) InvalidateOnPartitionLabelChange( partitionID string, ) { // Clear all cached bitmaps (expensive but rare) cbc.cache = make(map[string]*roaring.Bitmap) }

**Key Insight**: At 100B scale, per-vertex authorization checks are infeasible. Bitmap-based batch authorization is **mandatory** for acceptable performance.

Edit 15: Add Audit Log Sampling (Finding 14)

Location: Lines 863-870 (Audit Log Throughput) Add After Table:

### Audit Log Sampling Strategy

**Problem**: 388 TB for 90 days of audit logs is excessive (and expensive).

**Solution**: Sample non-sensitive queries, always log sensitive/denied:

audit_sampling_config:

Always log (100%)

always_log: - event_type: access_denied - event_type: sensitive_access - resource_labels: [pii, financial, secret] - principal_types: [admin, root, service_account]

Sample normal queries (1%)

sample: rate: 0.01 # 1% of normal queries method: consistent_hash # Same query_id always sampled/not

Enhanced logging for sampled queries

include_query_plan: true include_timing_breakdown: true


**Implementation**:

func (al *AuditLogger) ShouldLog(event *AuditEvent) bool { // Always log denials if event.Decision == ACCESS_DECISION_DENY { return true }

// Always log sensitive access
if al.IsSensitiveAccess(event) {
    return true
}

// Always log privileged principals
if al.IsPrivilegedPrincipal(event.PrincipalID) {
    return true
}

// Sample normal queries (1%)
hash := xxhash.Sum64String(event.QueryID)
return (hash % 100) < 1  // 1% sample rate

}


**Cost Savings**:

Original: 388 TB for 90 days With 1% sampling: 3.88 TB (normal) + 10 TB (sensitive/denied) = 13.88 TB Savings: 96% reduction

Storage cost: 13.88 TB × $23/TB = $319/month (vs $8,924 original)

**Trade-offs**:
- **Pro**: 96% cost savings, focus on important events
- **Con**: Can't reconstruct all query history (only sample)
- **Acceptable**: Compliance requirements met (all sensitive/denied logged)

Summary

This document defines 15 specific edits across 5 RFCs to address the 18 findings from MEMO-050:

RFC	Edits	Critical Fixes	High Priority	Medium Priority
RFC-057	5 edits	Network topology, Opaque IDs	Partition sizing, Hash function, Failure recovery	-
RFC-058	2 edits	Memory reconciliation	-	Index versioning
RFC-059	3 edits	S3 cost optimization	-	WAL replay, Hysteresis
RFC-060	3 edits	Query limits, Super-node handling	-	Observability
RFC-061	2 edits	-	Batch authorization	Audit sampling

Estimated Effort: 15-20 engineer-days to implement all edits with proper testing and validation.

Priority Order:

Week 1: Critical P0 edits (cost model, memory, limits)
Week 2: High priority P1 edits (performance, reliability)
Week 3: Medium priority P2 edits (operations, polish)

Next Steps:

Review this summary with architecture team
Assign edits to RFC owners
Schedule RFC update sprint
Update RFC status to "Draft - Revisions In Progress"
Final review after all edits complete

RFC-057: Massive-Scale Graph Sharding​

Edit 1: Update Partition Sizing (Finding 6)​

Edit 2: Replace CRC32 with xxHash (Finding 7)​

Edit 3: Add Network Topology Awareness (Finding 3)​

Edit 4: Add Failure Detection and Recovery (Finding 9)​

Edit 5: Add Opaque Vertex IDs (Finding 15)​

RFC-058: Multi-Level Graph Indexing​

Edit 6: Add Index Tiering for Memory Reconciliation (Finding 5)​

Edit 7: Add Index Versioning (Finding 13)​

RFC-059: Hot/Cold Storage Tiers​

Edit 8: Add S3 Request Cost Optimization (Finding 1)​

Edit 9: Add Snapshot Loading with WAL Replay (Finding 12)​

Edit 10: Add Hysteresis for Temperature Management (Finding 8)​

RFC-060: Distributed Gremlin Execution​

Edit 11: Add Query Resource Limits (Finding 4)​

Timing limits

Memory limits

Result limits

Parallelism limits

Cost limits (multi-tenant)

Edit 12: Add Super-Node Handling (Finding 2)​

Edit 13: Add Query Observability (Finding 11)​

RFC-061: Graph Authorization​

Edit 14: Add Batch Authorization Optimization (Finding 10)​

Edit 15: Add Audit Log Sampling (Finding 14)​

Always log (100%)

Sample normal queries (1%)

Enhanced logging for sampled queries

Summary​

RFC-057: Massive-Scale Graph Sharding

Edit 1: Update Partition Sizing (Finding 6)

Edit 2: Replace CRC32 with xxHash (Finding 7)

Edit 3: Add Network Topology Awareness (Finding 3)

Edit 4: Add Failure Detection and Recovery (Finding 9)

Edit 5: Add Opaque Vertex IDs (Finding 15)

RFC-058: Multi-Level Graph Indexing

Edit 6: Add Index Tiering for Memory Reconciliation (Finding 5)

Edit 7: Add Index Versioning (Finding 13)

RFC-059: Hot/Cold Storage Tiers

Edit 8: Add S3 Request Cost Optimization (Finding 1)

Edit 9: Add Snapshot Loading with WAL Replay (Finding 12)

Edit 10: Add Hysteresis for Temperature Management (Finding 8)

RFC-060: Distributed Gremlin Execution

Edit 11: Add Query Resource Limits (Finding 4)

Edit 12: Add Super-Node Handling (Finding 2)

Edit 13: Add Query Observability (Finding 11)

RFC-061: Graph Authorization

Edit 14: Add Batch Authorization Optimization (Finding 10)

Edit 15: Add Audit Log Sampling (Finding 14)

Summary