Skip to main content

MEMO-081: Graph Implementation Readiness Assessment

Status: Draft Author: Platform Team Created: 2025-11-16 Updated: 2025-11-16

Executive Summary

This memo assesses the prism-data-layer's readiness to implement the massive-scale graph features defined in RFC-057 through RFC-061. After analyzing the five graph RFCs against the current prism-data-layer codebase, I've identified critical foundational gaps that must be addressed before graph implementation can proceed.

Bottom Line: The prism-data-layer currently lacks the fundamental infrastructure needed for 100B-scale graph operations. We need to implement 15 foundational components across 4 phases before we can begin graph feature development.

Timeline: 16-20 weeks of foundational work before graph implementation can begin.

Risk: Attempting to implement graph features without these foundations will result in unbuildable prototypes and architectural debt.

RFC Requirements Summary

RFC-057: Massive-Scale Graph Sharding (100B Vertices)

Core Requirements:

  • Three-tier hierarchical sharding (cluster → proxy → partition)
  • Support for 1000+ lightweight compute nodes
  • 64 partitions per proxy (16,000 partitions cluster-wide)
  • Vertex ID format: {cluster_id}:{proxy_id}:{partition_id}:{local_vertex_id}
  • Opaque vertex ID routing tables for rebalancing flexibility
  • Dynamic partition rebalancing without downtime
  • 100M vertices per proxy, 1.56M vertices per partition

Scale Targets:

  • 100 billion vertices
  • 10 trillion edges
  • 30 TB total memory across cluster
  • Sub-second query latency for common traversals

RFC-058: Multi-Level Graph Indexing

Core Requirements:

  • Four-tier index hierarchy (partition → proxy → cluster → global)
  • Online index construction without blocking queries
  • Incremental index updates via distributed WAL
  • Bloom filter cascade for fast negative lookups
  • Index types: hash, range, inverted, edge (outgoing/incoming)
  • Index schema versioning and migration
  • Partition pruning using cluster-level indexes
  • Space efficiency: indexes <20% of graph data size

Performance Targets:

  • Sub-millisecond index lookups
  • Partition pruning: 99% reduction in queried partitions
  • 20,000× speedup for property-filtered queries

RFC-059: Hot/Cold Storage with S3 Snapshot Loading

Core Requirements:

  • Three-tier storage (hot: memory, warm: local SSD, cold: S3)
  • Hot tier: 10-20% of data in-memory
  • Cold tier: 80-90% on S3 Standard
  • Multi-format snapshot support (Parquet, Protobuf, JSON Lines, HDFS, Prometheus)
  • Parallel snapshot loading: 1000 workers, 210 TB in <20 minutes
  • Temperature classification (hot/warm/cold/frozen)
  • ML-based partition temperature prediction
  • WAL replay during snapshot load for consistency
  • Multi-tier caching (Varnish → CloudFront → S3 Express → Batch S3)

Cost Optimization:

  • 86% cost reduction vs pure in-memory ($7M/year vs $49M/year)
  • S3 request cost optimization (88.5% reduction via caching)

RFC-060: Distributed Gremlin Query Execution

Core Requirements:

  • Full Apache TinkerPop Gremlin 3.x specification support
  • Distributed query decomposition across 1000+ nodes
  • Cost-based query optimizer with selectivity estimation
  • Partition pruning via multi-level indexes
  • Adaptive parallelism based on intermediate result sizes
  • Result streaming without full materialization
  • Query resource limits (memory, timeout, fan-out, depth)
  • Super-node handling (100M+ degree vertices) with sampling
  • Batch authorization for large result sets
  • Distributed tracing and slow query logging

Performance Targets:

  • Sub-second latency for common traversals
  • 2,500× speedup via partition pruning
  • 10,000× speedup for batch authorization

RFC-061: Fine-Grained Graph Authorization

Core Requirements:

  • Vertex-level label-based access control (LBAC)
  • Security labels on vertices: ['org:acme', 'pii', 'confidential']
  • Principal clearances with hierarchical wildcard matching
  • Authorization policy push-down to partition level
  • Traversal filtering (auto-filter vertices during queries)
  • Batch authorization using Roaring bitmaps
  • Audit logging with intelligent sampling
  • Compliance support (SOC 2, GDPR, HIPAA)

Performance Targets:

  • <100 μs authorization overhead per vertex
  • 10,000× speedup for batch authorization (1M vertices in 1.1 ms)
  • 90% audit log cost reduction via sampling

Current prism-data-layer State Analysis

Existing Capabilities

What We Have:

  1. Basic Graph Pattern (RFC-055):

    • 1B vertex scale (not 100B)
    • 10 proxy architecture (not 1000)
    • In-memory MemStore backend
    • Simple consistent hashing
    • Basic vertex/edge CRUD operations
  2. Foundational Infrastructure:

    • gRPC-based proxy architecture
    • Protobuf message definitions
    • Basic admin protocol
    • Pattern registry system
    • Plugin architecture (RFC-008)
  3. Development Tooling:

    • Local testing infrastructure (RFC-016)
    • Documentation framework
    • CI/CD pipelines

Critical Gaps

I've identified 15 critical gaps across 4 phases that must be addressed before graph implementation:

Phase 1: Distributed Systems Foundations (4-6 weeks)

Gap 1.1: Hierarchical Partition Registry

What's Missing:

  • No three-tier partition model (cluster → proxy → partition)
  • Current consistent hashing is flat (no hierarchy)
  • No partition ID format: {cluster}:{proxy}:{partition}:{local_id}
  • No cluster/proxy/partition metadata storage

Required Implementation:

message ClusterTopology {
string cluster_id = 1;
repeated ProxyNode proxies = 2;
PartitionStrategy strategy = 3;
}

message ProxyNode {
string proxy_id = 1;
string cluster_id = 2;
repeated Partition partitions = 3;
int64 vertex_count = 4;
int64 memory_bytes = 5;
}

message Partition {
string partition_id = 1; // "07:0042:05"
PartitionTemperature temperature = 2;
int64 vertex_count = 3;
string storage_location = 4; // S3 path or "memory"
}

Acceptance Criteria:

  • Protobuf definitions for cluster topology
  • In-memory partition registry with hierarchical lookup
  • Partition ID parsing: {cluster}:{proxy}:{partition}:{local_id}
  • Unit tests: 1000 proxies × 64 partitions = 16,000 partitions
  • Benchmark: <100ns partition ID parse time

Effort: 1-2 weeks

Gap 1.2: Distributed Routing Table for Opaque Vertex IDs

What's Missing:

  • No opaque vertex ID support (currently hierarchical only)
  • No distributed routing table (vertex ID → partition location)
  • No routing cache with 99% hit rate
  • No routing table sharding across cluster

Required Implementation:

type OpaqueVertexRouter struct {
routingTable *DistributedHashTable // 256 shards
bloomFilter *bloom.BloomFilter // Fast negative lookup
cache *RoutingCache // 1M entries, 100 MB
}

func (ovr *OpaqueVertexRouter) RouteVertex(vertexID string) (*PartitionLocation, error)
func (ovr *OpaqueVertexRouter) UpdateLocation(vertexID string, location *PartitionLocation) error
func (ovr *OpaqueVertexRouter) InvalidateCache(vertexID string) error

Acceptance Criteria:

  • Distributed routing table with consistent hashing
  • LRU cache with configurable size (default: 1M entries)
  • Bloom filter for negative lookups (0.1% false positive rate)
  • Routing table replication (3× for fault tolerance)
  • Unit tests: 100M vertex ID lookups
  • Benchmark: <1 μs cold lookup, <10 ns hot lookup (cached)

Effort: 2 weeks

Gap 1.3: Distributed Coordination Service

What's Missing:

  • No distributed coordination (no Raft/etcd integration)
  • No cluster membership management
  • No partition assignment consensus
  • No leader election for coordinators

Required Implementation:

  • Integrate etcd or implement Raft consensus
  • Cluster membership registry (proxies join/leave)
  • Partition assignment coordination
  • Leader election for query coordinators

Acceptance Criteria:

  • etcd client integration for cluster metadata
  • Cluster membership with health checks
  • Partition assignment API
  • Leader election for coordinators
  • Failure detection and failover (<5 second recovery)

Effort: 2-3 weeks

Gap 1.4: Distributed Write-Ahead Log (WAL)

What's Missing:

  • No distributed WAL (RFC-051 not implemented)
  • No Kafka integration for durable writes
  • No WAL consumer for multi-tier updates
  • No WAL replay for consistency during snapshot load

Required Implementation:

type WALProducer struct {
kafkaProducer *kafka.Producer
topic string // "graph-wal-cluster-{cluster_id}"
partitions int // 1000 (one per proxy)
}

type WALConsumer struct {
kafkaConsumer *kafka.Consumer
groupID string
}

type WALOperation struct {
Type OperationType // CREATE_VERTEX, UPDATE_VERTEX, DELETE_VERTEX, etc.
PartitionID string
Vertex *Vertex
Edge *Edge
Timestamp int64
}

Acceptance Criteria:

  • Kafka producer for WAL writes
  • Kafka consumer for WAL replay
  • WAL operation protobuf definitions
  • WAL consumer with offset tracking
  • WAL replay during partition load
  • Unit tests: 10k ops/sec per partition
  • Benchmark: <5 ms end-to-end WAL write latency

Effort: 2 weeks

Phase 2: Storage Tier Infrastructure (4-5 weeks)

Gap 2.1: Hot/Cold/Warm Storage Tier Management

What's Missing:

  • No storage tier abstraction (only MemStore)
  • No S3 integration for cold storage
  • No local SSD management for warm storage
  • No temperature classification for partitions
  • No promotion/demotion policies

Required Implementation:

type PartitionManager struct {
hotTier *MemStore
warmTier *LocalSSD
coldTier *S3Backend
classifier *TemperatureClassifier
}

func (pm *PartitionManager) GetVertex(vertexID string) (*Vertex, error) {
// Try hot → warm → cold
}

func (pm *PartitionManager) PromotePartition(partitionID string, targetTemp PartitionTemperature) error
func (pm *PartitionManager) DemotePartition(partitionID string, targetTemp PartitionTemperature) error

Acceptance Criteria:

  • Storage tier interface (hot/warm/cold)
  • S3 backend client (read/write partitions)
  • Local SSD backend (NVMe-optimized)
  • Tier-aware vertex/edge lookup with fallback
  • Temperature classification algorithm (ML-based or rule-based)
  • Promotion/demotion policies with hysteresis
  • Unit tests: Tier fallback chain
  • Benchmark: <50 μs hot, <100 μs warm, <50 ms cold

Effort: 2-3 weeks

Gap 2.2: Snapshot Loading Infrastructure

What's Missing:

  • No bulk snapshot loading (only incremental inserts)
  • No Parquet/Protobuf/JSON Lines readers
  • No parallel loading across 1000 workers
  • No WAL replay during snapshot load
  • No dual-version loading (active + loading)

Required Implementation:

type SnapshotLoader struct {
loaders map[SnapshotFormat]FormatLoader
}

type FormatLoader interface {
LoadVertices(s3Path string, partition *Partition) error
LoadEdges(s3Path string, partition *Partition) error
}

// Implementations:
// - ParquetLoader (Apache Arrow)
// - ProtobufLoader (zero-copy mmap)
// - JSONLinesLoader

Acceptance Criteria:

  • Parquet loader using Apache Arrow
  • Protobuf loader with memory-mapped files
  • JSON Lines loader
  • Parallel loading coordinator (1000 workers)
  • WAL replay integration for consistency
  • Dual-version partition loading (active + loading)
  • Atomic swap after load complete
  • Unit tests: Load 1M vertex snapshot
  • Benchmark: <60 seconds for 210 TB (1000 workers)

Effort: 3 weeks

Gap 2.3: S3 Multi-Tier Caching

What's Missing:

  • No S3 request cost optimization
  • No CloudFront CDN integration
  • No proxy-local Varnish cache
  • No S3 Express One Zone for hot cold partitions
  • No batch S3 reads

Required Implementation:

type CacheManager struct {
localCache *VarnishCache // Tier 0: 100 GB SSD per proxy
cdnCache *CloudFrontCache // Tier 1: Global CDN
s3Express *S3ExpressClient // Tier 2: Hot cold partitions
s3Standard *S3StandardClient // Tier 3: Batch reads
}

func (cm *CacheManager) FetchPartition(partitionID string) (*Partition, error) {
// Try local → CDN → S3 Express → S3 Standard
}

Acceptance Criteria:

  • Varnish HTTP cache integration (per-proxy)
  • CloudFront CDN configuration
  • S3 Express One Zone client
  • Batch S3 read prefetching (10 partitions per request)
  • Cache invalidation on partition update
  • Unit tests: Cache tier fallback
  • Benchmark: 72% cache hit rate (avoid S3)

Effort: 2 weeks

Phase 3: Graph Query Infrastructure (4-5 weeks)

Gap 3.1: Multi-Level Index Infrastructure

What's Missing:

  • No index abstraction layer
  • No partition-level indexes (hash, range, inverted, edge)
  • No proxy-level aggregated indexes
  • No cluster-level directory indexes
  • No Bloom filter cascade
  • No index schema versioning

Required Implementation:

type PartitionIndex struct {
HashIndexes map[string]*PropertyHashIndex
RangeIndexes map[string]*PropertyRangeIndex
InvertedIndexes map[string]*InvertedIndex
EdgeIndexes map[string]*EdgeIndex
BloomFilter *bloom.BloomFilter
SchemaVersion int32 // Current: 5
}

type IndexBuilder struct {
partition *Partition
}

func (ib *IndexBuilder) BuildOnline() error // Online index construction
func (ib *IndexBuilder) UpdateIncremental(walOp *WALOperation) error

Acceptance Criteria:

  • Partition index protobuf definitions
  • Hash index (exact match): O(1) lookup
  • Range index (buckets + Roaring bitmaps): O(log N) lookup
  • Inverted index (property value → vertex IDs)
  • Edge index (outgoing/incoming adjacency lists)
  • Bloom filter for negative lookups
  • Index schema versioning (v1 → v5 migration)
  • Online index building without blocking queries
  • Incremental WAL-based index updates
  • Unit tests: 1M vertex index construction
  • Benchmark: <1 ms index lookup, <10 GB index size per partition

Effort: 3 weeks

Gap 3.2: Gremlin Query Parser and Planner

What's Missing:

  • No Gremlin AST parser
  • No query decomposition (no stages)
  • No cost-based optimizer
  • No partition pruning strategy
  • No selectivity estimation

Required Implementation:

type GremlinParser struct {
// Parse Gremlin query string to AST
}

type QueryPlanner struct {
indexManager *IndexManager
costModel *CostModel
}

func (qp *QueryPlanner) Parse(gremlinQuery string) (*QueryAST, error)
func (qp *QueryPlanner) Analyze(ast *QueryAST) (*QueryPlan, error)
func (qp *QueryPlanner) Optimize(plan *QueryPlan) (*QueryPlan, error)

Acceptance Criteria:

  • Gremlin query string parser (Apache TinkerPop subset)
  • Query AST representation (V, E, hasLabel, has, out, in, limit, etc.)
  • Query decomposition into stages
  • Cost-based optimizer with selectivity estimation
  • Partition pruning using multi-level indexes
  • Filter ordering optimization (most selective first)
  • Unit tests: Parse 100 common Gremlin queries
  • Benchmark: <10 ms query planning

Effort: 3 weeks

Gap 3.3: Distributed Query Execution Engine

What's Missing:

  • No distributed query coordinator
  • No partition executor
  • No cross-partition traversal coordinator
  • No result streaming
  • No adaptive parallelism
  • No query resource limits

Required Implementation:

type QueryCoordinator struct {
planner *QueryPlanner
partitionMgr *PartitionManager
}

type PartitionExecutor struct {
partition *Partition
indexes *PartitionIndex
}

func (qc *QueryCoordinator) ExecuteGremlin(
ctx context.Context,
gremlinQuery string,
principal *Principal,
) (*ResultStream, error)

func (pe *PartitionExecutor) ExecuteStep(
ctx context.Context,
step *GremlinStep,
inputVertices []*Vertex,
) ([]*Vertex, error)

Acceptance Criteria:

  • Query coordinator (fan-out to partitions)
  • Partition executor (local query execution)
  • Cross-partition traversal coordinator
  • Result streaming with backpressure
  • Adaptive parallelism (1-1000 workers)
  • Query resource limits (memory, timeout, fan-out, depth)
  • Circuit breaker for runaway queries
  • Unit tests: Execute 2-hop traversal across 100 partitions
  • Benchmark: <1 second for 1M vertex query

Effort: 3 weeks

Phase 4: Authorization and Observability (3-4 weeks)

Gap 4.1: Label-Based Access Control (LBAC)

What's Missing:

  • No security label support on vertices/edges
  • No principal clearance model
  • No authorization policy engine
  • No batch authorization with bitmaps
  • No traversal filtering

Required Implementation:

type AuthorizationManager struct {
policy *AuthorizationPolicy
bitmapCache *AuthBitmapCache
auditLogger *AuditLogger
}

func (am *AuthorizationManager) CanAccessVertex(
ctx *AuthorizationContext,
vertex *Vertex,
) (bool, error)

func (am *AuthorizationManager) FilterVerticesBatch(
principal *Principal,
vertices []*Vertex,
) ([]*Vertex, error) // Batch authorization with bitmaps

Acceptance Criteria:

  • Security label protobuf definitions
  • Principal clearance model
  • Authorization policy evaluation (<100 μs per vertex)
  • Hierarchical wildcard matching (org:acme:**)
  • Batch authorization with Roaring bitmaps
  • Partition-level authorization bitmap cache
  • Traversal filtering (auto-inject authz filters)
  • Unit tests: Authorize 1M vertices
  • Benchmark: 1.1 ms for 1M vertex batch authorization

Effort: 2 weeks

Gap 4.2: Audit Logging Infrastructure

What's Missing:

  • No audit event definitions
  • No Kafka-based audit log sink
  • No intelligent sampling (1% base, 100% sensitive)
  • No compliance query support
  • No audit log retention policies

Required Implementation:

type AuditLogger struct {
kafkaProducer *kafka.Producer
sampler *ComplianceSampler
}

type ComplianceSampler struct {
baseSampleRate float64 // 0.01 = 1%
alwaysLog []string // ['pii', 'confidential', 'financial']
}

func (al *AuditLogger) LogAccess(ctx *AuthorizationContext, vertex *Vertex)
func (al *AuditLogger) LogDenial(ctx *AuthorizationContext, vertex *Vertex, reason string)

Acceptance Criteria:

  • Audit event protobuf definitions
  • Kafka producer for audit events
  • Intelligent sampling (1% base, 100% sensitive)
  • Compliance rules (SOC 2, GDPR, HIPAA)
  • Audit log retention (hot: 7d, warm: 30d, cold: 90d)
  • Unit tests: 10M events/sec sampling
  • Benchmark: <10 μs audit log overhead

Effort: 1 week

Gap 4.3: Distributed Tracing and Observability

What's Missing:

  • No OpenTelemetry integration
  • No distributed tracing across 1000 nodes
  • No slow query logging
  • No query timeline visualization
  • No EXPLAIN plan for queries

Required Implementation:

type QueryTracer struct {
tracer trace.Tracer // OpenTelemetry
}

func (qt *QueryTracer) TraceQueryExecution(
ctx context.Context,
query string,
) (*QueryTimeline, error)

type QueryTimeline struct {
QueryID string
TotalTime time.Duration
Stages []*StageTimeline
Bottlenecks []Bottleneck
}

Acceptance Criteria:

  • OpenTelemetry SDK integration
  • Distributed trace context propagation
  • Query timeline capture (per stage, per partition)
  • Slow query logging (threshold: 5 seconds)
  • EXPLAIN plan implementation
  • SignOz/Jaeger trace visualization
  • Prometheus metrics (query latency, partition execution time)
  • Unit tests: Trace 3-hop query across 100 partitions
  • Benchmark: <100 μs tracing overhead

Effort: 2 weeks

Phase 5: Graph-Specific Features (4-6 weeks, after foundations)

Gap 5.1: Super-Node Handling

What's Missing:

  • No vertex degree classification (normal/hub/super/mega)
  • No sampling strategies (random, top-K, HyperLogLog)
  • No circuit breaker for high-degree vertices
  • No Gremlin .sample() and .approximate() extensions

Required Implementation:

type SuperNodeCircuitBreaker struct {
DegreeThreshold int64 // 100k neighbors = super-node
AutoSample bool
DefaultSampleSize int // 10k samples
}

func (scb *SuperNodeCircuitBreaker) CheckDegree(vertexID string) (int64, VertexClass)
func (scb *SuperNodeCircuitBreaker) SampleNeighbors(vertexID string, size int) ([]*Vertex, error)

Acceptance Criteria:

  • Vertex degree classification (based on out-degree)
  • Random sampling strategy (reservoir sampling)
  • Top-K sampling (by property: engagement, recency)
  • HyperLogLog cardinality estimation (0.8% error)
  • Circuit breaker for queries to super-nodes
  • Gremlin .sample(N) extension
  • Gremlin .approximate() extension
  • Unit tests: Sample 100M neighbors to 10k
  • Benchmark: <100 ms super-node sampling

Effort: 2 weeks

Gap 5.2: Dynamic Partition Rebalancing

What's Missing:

  • No partition migration without downtime
  • No dual-read during migration
  • No rebalancing coordinator
  • No partition copy/verify/swap

Required Implementation:

type RebalancingCoordinator struct {
sourceProxy string
targetProxy string
}

func (rc *RebalancingCoordinator) MigratePartition(
partitionID string,
targetProxy string,
) error {
// 1. Copy partition to target
// 2. Dual-read (source + target)
// 3. Verify consistency
// 4. Update routing table
// 5. Delete source
}

Acceptance Criteria:

  • Partition copy (source → target)
  • Dual-read during migration
  • Consistency verification (checksum)
  • Routing table update (atomic swap)
  • Source partition cleanup
  • Zero downtime guarantee
  • Unit tests: Migrate 64 partitions
  • Benchmark: <10 seconds per partition (opaque IDs)

Effort: 2 weeks

Implementation Roadmap

Phase 1: Foundations (4-6 weeks)

Deliverables:

  • Hierarchical partition registry
  • Distributed routing table
  • Distributed coordination (etcd)
  • Distributed WAL (Kafka)

Milestone: Can route queries to correct partition using hierarchical IDs

Phase 2: Storage (4-5 weeks)

Deliverables:

  • Hot/cold/warm storage tier management
  • Snapshot loading infrastructure
  • S3 multi-tier caching

Milestone: Can load 1M vertex snapshot from S3 and serve queries from hot/cold tiers

Phase 3: Query Engine (4-5 weeks)

Deliverables:

  • Multi-level index infrastructure
  • Gremlin query parser and planner
  • Distributed query execution engine

Milestone: Can execute Gremlin queries across 100 partitions with partition pruning

Phase 4: Security and Observability (3-4 weeks)

Deliverables:

  • Label-based access control
  • Audit logging infrastructure
  • Distributed tracing

Milestone: Can authorize queries with vertex-level labels and audit access

Phase 5: Graph Features (4-6 weeks)

Deliverables:

  • Super-node handling
  • Dynamic partition rebalancing

Milestone: Can handle celebrity vertices (100M degree) and rebalance partitions without downtime

Total Timeline: 19-26 weeks (4.75-6.5 months)

Risk Assessment

High-Risk Dependencies

Risk 1: Distributed Coordination Complexity

  • Likelihood: High
  • Impact: Critical
  • Mitigation: Use etcd (proven) instead of building Raft from scratch
  • Fallback: Start with single-coordinator (not HA) for prototyping

Risk 2: WAL Replay Consistency

  • Likelihood: Medium
  • Impact: Critical
  • Mitigation: Comprehensive WAL replay tests with simulated failures
  • Fallback: Disable snapshot loading until WAL replay is proven

Risk 3: S3 Request Cost Explosion

  • Likelihood: High (identified in RFC-059)
  • Impact: High ($1M/year without mitigation)
  • Mitigation: Implement multi-tier caching in Phase 2 (mandatory)
  • Fallback: N/A - caching is non-negotiable at 100B scale

Risk 4: Index Size Explosion

  • Likelihood: Medium
  • Impact: High (16 TB indexes cluster-wide)
  • Mitigation: Roaring bitmap compression, Bloom filter optimization
  • Fallback: Reduce indexed properties to top 10 most queried

Risk 5: Query Resource Exhaustion

  • Likelihood: High (celebrity super-nodes)
  • Impact: Critical (cluster-wide OOM)
  • Mitigation: Circuit breaker + sampling (Phase 5, but prototype in Phase 3)
  • Fallback: Hard-code .limit(10000) on all traversals during development

Medium-Risk Dependencies

Risk 6: Gremlin Compatibility

  • Likelihood: Medium
  • Impact: Medium
  • Mitigation: Implement TinkerPop test suite subset
  • Fallback: Document unsupported Gremlin steps

Risk 7: Authorization Performance

  • Likelihood: Low (bitmap approach proven in RFC-061)
  • Impact: Medium
  • Mitigation: Benchmark early in Phase 4
  • Fallback: Reduce clearance check frequency (sample 10% of queries)

Testing Strategy

Unit Test Coverage Targets

Phase 1: 90% coverage (foundations must be solid)

  • Partition registry: 95%
  • Routing table: 90%
  • WAL producer/consumer: 95%

Phase 2: 85% coverage

  • Storage tier management: 85%
  • Snapshot loaders: 80%

Phase 3: 80% coverage

  • Index construction: 85%
  • Query planner: 80%
  • Query executor: 75%

Phase 4: 85% coverage

  • Authorization: 90%
  • Audit logging: 80%

Phase 5: 75% coverage

  • Super-node handling: 75%
  • Rebalancing: 80%

Integration Test Requirements

Critical Integration Tests:

  1. End-to-End Query Test (Phase 3 milestone):

    • Load 1M vertex graph
    • Execute Gremlin query: g.V().has('city', 'SF').out('FOLLOWS').limit(100)
    • Verify: Query completes in <1 second
    • Verify: Correct 100 results returned
  2. Snapshot Load + WAL Replay Test (Phase 2 milestone):

    • Load 1M vertex snapshot
    • Inject 10k WAL operations during load
    • Verify: All WAL operations applied to loaded snapshot
    • Verify: No data loss
  3. Authorization + Query Test (Phase 4 milestone):

    • Create graph with 3 organizations (org:acme, org:widget, org:globex)
    • Principal with clearance: ['org:acme']
    • Execute query: g.V().hasLabel('User')
    • Verify: Only org:acme users returned
  4. Super-Node Handling Test (Phase 5 milestone):

    • Create vertex with 100M neighbors
    • Execute query: g.V('celebrity').out('FOLLOWS')
    • Verify: Circuit breaker triggers sampling
    • Verify: Query returns 10k sampled neighbors (not 100M)
    • Verify: Query completes in <1 second

Performance Benchmarks

Critical Benchmarks:

  • Partition ID parse: <100 ns
  • Routing table lookup (cold): <1 μs
  • Routing table lookup (hot): <10 ns
  • Index lookup: <1 ms
  • Query planning: <10 ms
  • Authorization check (single vertex): <100 μs
  • Batch authorization (1M vertices): <1.1 ms
  • WAL write latency: <5 ms

Success Criteria

Phase 1 Success:

  • ✅ Route 1M queries to correct partition using hierarchical IDs
  • ✅ WAL write throughput: 10k ops/sec per partition
  • ✅ Partition registry supports 16,000 partitions

Phase 2 Success:

  • ✅ Load 1M vertex Parquet snapshot in <5 seconds (single proxy)
  • ✅ Tier fallback: hot (50 μs) → warm (100 μs) → cold (50 ms)
  • ✅ S3 cache hit rate: 72% (avoid S3 requests)

Phase 3 Success:

  • ✅ Execute Gremlin 2-hop traversal across 100 partitions in <1 second
  • ✅ Partition pruning: 99% reduction in queried partitions
  • ✅ Index lookup: <1 ms per property filter

Phase 4 Success:

  • ✅ Authorize 1M vertex query in <1.1 ms (batch authorization)
  • ✅ Audit log sampling: 1% base, 100% sensitive
  • ✅ Distributed traces for all queries (OpenTelemetry)

Phase 5 Success:

  • ✅ Super-node query (100M neighbors) completes in <1 second (sampled)
  • ✅ Partition migration: <10 seconds per partition (opaque IDs)
  • ✅ Zero downtime during partition rebalancing

Recommendations

Immediate Actions (Next 2 Weeks)

  1. Create Phase 1 Implementation Plan:

    • Break down Gap 1.1-1.4 into 2-week sprints
    • Assign developers to each gap
    • Set up dev environments with Kafka + etcd
  2. Prototype Critical Paths:

    • Prototype hierarchical partition ID parsing (1 day)
    • Prototype Kafka WAL producer (2 days)
    • Prototype etcd cluster membership (3 days)
  3. Set Up Testing Infrastructure:

    • Docker compose for Kafka + etcd + SQLite
    • Performance benchmark harness
    • Integration test framework

Medium-Term Actions (Weeks 3-8)

  1. Complete Phase 1:

    • Deliver all 4 gaps by end of Week 6
    • Milestone demo: Route 1M queries across 16,000 partitions
  2. Begin Phase 2 in Parallel:

    • Start snapshot loader prototyping (Week 5)
    • S3 backend client implementation (Week 6)

Long-Term Actions (Weeks 9-26)

  1. Incremental Delivery:

    • Phase 2: Complete by Week 12
    • Phase 3: Complete by Week 18
    • Phase 4: Complete by Week 22
    • Phase 5: Complete by Week 26
  2. Weekly Demos:

    • Every Friday: Demo incremental progress
    • Monthly milestones with stakeholder reviews

Conclusion

The prism-data-layer requires 16-20 weeks of foundational work before graph implementation can begin. This is not optional—attempting to implement graph features without these foundations will result in unbuildable systems and architectural debt.

The path forward:

  1. Accept the 16-20 week timeline for foundations
  2. Begin Phase 1 immediately (distributed systems foundations)
  3. Deliver incrementally with weekly demos
  4. Prioritize critical gaps (WAL, storage tiers, indexes)

Alternative (not recommended): Attempt graph implementation without foundations, resulting in:

  • Unbuildable prototypes (no WAL, no distributed coordination)
  • Performance failures (no indexes, no caching)
  • Cost explosions (no S3 optimization)
  • Security gaps (no authorization)

Recommended approach: Invest 16-20 weeks in solid foundations, then implement graph features on proven infrastructure.