storagebackendevaluationgraphmassive-scale

Author: Platform TeamCreated: Nov 16, 2025Updated: Nov 16, 2025

MEMO-073: Week 13 - Storage Backend Evaluation for Massive-Scale Graphs

Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-052, RFC-057, RFC-058, RFC-059

Executive Summary

Goal: Evaluate storage backend options for 100B vertex graph system

Scope: 8 storage backends ranked by implementability for graph workloads

Findings:

Best for graphs: Neptune (native graph), TigerGraph (native graph)
Best for scale: S3/MinIO (cold storage), ClickHouse (time-series)
Most practical: PostgreSQL + pg_timbala (relational), Redis (in-memory)
Implementability winner: Redis (rank #1, score 95/100)
Cost winner: S3/MinIO (cold storage tier)

Recommendation: Hybrid approach - Redis (hot tier) + S3 (cold tier) + PostgreSQL (metadata) as validated by RFC-059.

Methodology

Evaluation Criteria

Implementability Score (0-100):

Go SDK Quality (30 points): Official SDK, community support, documentation
Data Model Fit (30 points): How naturally backend supports graph operations
Testing Difficulty (20 points): Local testing, Docker support, test data generation
Operational Complexity (20 points): Deployment, monitoring, scaling

Data Models Supported

For graph workloads, backends must support:

Vertices: Key-value or document storage
Edges: Adjacency lists or edge tables
Properties: Nested attributes on vertices/edges
Indexes: Property lookups, traversal optimization
Partitioning: Distribute across nodes

Findings

Backend Ranking Summary

Rank	Backend	Score	Go SDK	Data Model	Testing	Best For
1	Redis	95/100	✅ Excellent	✅ Graph-friendly	✅ Easy	Hot tier caching
2	PostgreSQL	90/100	✅ Excellent	✅ Good (JSONB)	✅ Easy	Metadata, indexes
3	SQLite	85/100	✅ Good	✅ Good (JSON)	✅ Trivial	Dev/testing
4	S3/MinIO	80/100	✅ Good	⚠️ Snapshot only	✅ Easy	Cold storage
5	ClickHouse	75/100	✅ Good	⚠️ Time-series	⚠️ Moderate	Analytics
6	Kafka	70/100	✅ Good	⚠️ Event stream	⚠️ Moderate	Event sourcing
7	NATS	65/100	✅ Good	⚠️ Messaging	⚠️ Moderate	Pub/sub
8	Neptune	50/100	❌ None (HTTP)	✅ Native graph	❌ Hard	AWS-only graphs

Key Insight: Native graph databases (Neptune, TigerGraph) score lowest on implementability despite best data model fit, due to lack of Go SDK and testing complexity.

Detailed Backend Evaluation

Rank #1: Redis (Score: 95/100) ✅

Overview

Type: In-memory key-value store with data structures Best For: Hot tier vertex/edge caching, real-time access patterns Used In: RFC-057 (hot tier), RFC-059 (10% hot data)

Go SDK Quality (30/30) ✅

// Official: github.com/redis/go-redis/v9
import "github.com/redis/go-redis/v9"

client := redis.NewClient(&redis.Options{
    Addr: "localhost:6379",
    DB:   0,
})

// Excellent API, strong typing, context support
ctx := context.Background()
err := client.Set(ctx, "vertex:123", vertexJSON, 0).Err()

Assessment:

✅ Official Go SDK maintained by Redis
✅ Excellent documentation with examples
✅ Strong community (19k+ GitHub stars)
✅ Context-aware, idiomatic Go
✅ Pipelining, transactions, pub/sub support

Data Model Fit (30/30) ✅

Vertex Storage:

// Option 1: Hash (structured)
client.HSet(ctx, "vertex:user:123", map[string]interface{}{
    "id":       "123",
    "name":     "Alice",
    "age":      30,
    "country":  "USA",
})

// Option 2: JSON (Redis Stack)
client.JSONSet(ctx, "vertex:user:123", "$", vertexStruct)

Edge Storage (Adjacency Lists):

// Sorted set for edges (score = timestamp or weight)
client.ZAdd(ctx, "edges:user:123:friends", redis.Z{
    Score:  float64(time.Now().Unix()),
    Member: "user:456",
})

// Retrieve friends
friends := client.ZRange(ctx, "edges:user:123:friends", 0, -1)

Indexes:

// Secondary indexes via sets
client.SAdd(ctx, "idx:country:USA", "user:123", "user:456")

// Retrieve all users in USA
usersInUSA := client.SMembers(ctx, "idx:country:USA")

Assessment:

✅ Native support for adjacency lists (sorted sets)
✅ Efficient property indexes (sets)
✅ JSON support via Redis Stack module
✅ Atomic operations for consistency
⚠️ No native graph traversal (implement in application)

Testing Difficulty (20/20) ✅

Local Testing:

# Podman/Docker
podman run -d --name redis -p 6379:6379 redis:7-alpine

# Or: redis-server (native install)
brew install redis
redis-server

Go Test Integration:

func TestRedisVertex(t *testing.T) {
    // Use testcontainers-go for isolated tests
    ctx := context.Background()
    redisC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: testcontainers.ContainerRequest{
            Image:        "redis:7-alpine",
            ExposedPorts: []string{"6379/tcp"},
        },
        Started: true,
    })
    defer redisC.Terminate(ctx)

    // Connect and test
    endpoint, _ := redisC.Endpoint(ctx, "")
    client := redis.NewClient(&redis.Options{Addr: endpoint})
    // ... test code
}

Assessment:

✅ Single-binary, no dependencies
✅ Instant startup (<1 second)
✅ Excellent testcontainers-go support
✅ In-memory = fast tests
✅ No schema migrations needed

Operational Complexity (15/20) ✅

Deployment:

✅ Stateless deployment with Redis Cluster
✅ Excellent Kubernetes operators (Redis Enterprise, Bitnami)
⚠️ Persistence requires RDB/AOF configuration
⚠️ Memory management (eviction policies)

Monitoring:

✅ Built-in INFO command exposes all metrics
✅ Prometheus exporter available
✅ Grafana dashboards

Scaling:

✅ Horizontal: Redis Cluster (sharding)
✅ Vertical: Add memory
⚠️ Rebalancing requires cluster resharding

Assessment: Mature operational tooling, memory constraints require planning.

Overall Assessment

Strengths:

✅ Best Go SDK of all backends
✅ Perfect data model for hot tier graphs
✅ Trivial local testing
✅ Sub-millisecond latency
✅ Battle-tested at scale (Twitter, GitHub, StackOverflow)

Weaknesses:

⚠️ Memory-bound (expensive at 100B scale)
⚠️ No native graph traversal (application-level)
⚠️ Persistence trade-offs (RDB snapshots vs AOF overhead)

Use Case: ✅ Ideal for hot tier (10% of data) as validated by RFC-059

Rank #2: PostgreSQL (Score: 90/100) ✅

Overview

Type: Relational database with JSONB support Best For: Metadata, indexes, small-to-medium graphs Used In: RFC-058 (index storage), potential for partition metadata

Go SDK Quality (30/30) ✅

// Popular: github.com/lib/pq or github.com/jackc/pgx/v5
import "github.com/jackc/pgx/v5/pgxpool"

pool, _ := pgxpool.New(ctx, "postgres://user:pass@localhost:5432/graphdb")

// Excellent query builder, prepared statements
var vertex Vertex
err := pool.QueryRow(ctx,
    "SELECT id, properties FROM vertices WHERE id = $1",
    vertexID,
).Scan(&vertex.ID, &vertex.Properties)

Assessment:

✅ Multiple excellent Go drivers (lib/pq, pgx)
✅ Strong typing, connection pooling
✅ Excellent documentation
✅ Native support for JSON/JSONB
✅ Prepared statements, batch operations

Data Model Fit (25/30) ✅

Schema Design:

-- Vertices table
CREATE TABLE vertices (
    id BIGINT PRIMARY KEY,
    label TEXT NOT NULL,
    properties JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Edges table (adjacency list)
CREATE TABLE edges (
    src_id BIGINT NOT NULL,
    dst_id BIGINT NOT NULL,
    label TEXT NOT NULL,
    properties JSONB,
    PRIMARY KEY (src_id, dst_id, label)
);

-- Indexes for traversal
CREATE INDEX idx_edges_src ON edges(src_id);
CREATE INDEX idx_edges_dst ON edges(dst_id);
CREATE INDEX idx_vertices_props ON vertices USING GIN(properties);

Graph Operations:

-- Find friends (1-hop)
SELECT v.* FROM vertices v
JOIN edges e ON e.dst_id = v.id
WHERE e.src_id = 123 AND e.label = 'friend';

-- Property filter
SELECT * FROM vertices
WHERE properties @> '{"country": "USA"}';

-- 2-hop traversal (CTE)
WITH RECURSIVE friends AS (
    SELECT dst_id, 1 as depth FROM edges WHERE src_id = 123
    UNION
    SELECT e.dst_id, f.depth + 1
    FROM edges e
    JOIN friends f ON e.src_id = f.dst_id
    WHERE f.depth < 2
)
SELECT v.* FROM vertices v JOIN friends f ON v.id = f.dst_id;

Assessment:

✅ JSONB excellent for flexible properties
✅ GIN indexes for JSONB queries
✅ Recursive CTEs for traversals (up to ~3 hops practical)
⚠️ Deep traversals (4+ hops) become expensive
⚠️ No native graph algorithms

Testing Difficulty (20/20) ✅

Local Testing:

# Podman
podman run -d --name postgres \
    -e POSTGRES_PASSWORD=secret \
    -p 5432:5432 \
    postgres:16-alpine

Test Helpers:

func TestPostgresGraph(t *testing.T) {
    // Use testcontainers-go
    ctx := context.Background()
    pgC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: testcontainers.ContainerRequest{
            Image:        "postgres:16-alpine",
            Env:          map[string]string{"POSTGRES_PASSWORD": "secret"},
            ExposedPorts: []string{"5432/tcp"},
            WaitingFor:   wait.ForLog("database system is ready"),
        },
        Started: true,
    })
    defer pgC.Terminate(ctx)

    // Run migrations, seed test data
    // ... test code
}

Assessment:

✅ Excellent testcontainers-go support
✅ Fast startup (~3 seconds)
✅ Schema migrations via goose/migrate
✅ Test data generation straightforward

Operational Complexity (15/20) ✅

Deployment:

✅ Mature Kubernetes operators (Crunchy, Zalando)
✅ Excellent backup/restore (pg_dump, WAL archiving)
✅ Streaming replication

Monitoring:

✅ pg_stat_* views expose all metrics
✅ Excellent Prometheus exporters
✅ Deep observability (query plans, slow logs)

Scaling:

✅ Vertical: Add CPU/memory/storage
⚠️ Horizontal: Requires sharding (Citus, manual)
⚠️ Large tables (>100M rows) need partitioning

Assessment: Excellent operational maturity, horizontal scaling requires extensions.

Overall Assessment

Strengths:

✅ Excellent Go SDK (pgx)
✅ JSONB perfect for flexible properties
✅ Recursive CTEs for limited traversals
✅ Trivial local testing
✅ 40+ years of operational knowledge

Weaknesses:

⚠️ Deep traversals (4+ hops) expensive
⚠️ Horizontal scaling requires extensions
⚠️ Not optimized for graph algorithms

Use Case: ✅ Ideal for metadata, indexes, small graphs (<1B vertices)

Rank #3: SQLite (Score: 85/100) ✅

Overview

Type: Embedded relational database Best For: Development, testing, single-node graphs Used In: Local development, CI/CD tests

Go SDK Quality (28/30) ✅

// Popular: github.com/mattn/go-sqlite3 (CGo) or modernc.org/sqlite (pure Go)
import (
    "database/sql"
    _ "modernc.org/sqlite"  // Pure Go, no CGo
)

db, _ := sql.Open("sqlite", "graph.db")

// Standard database/sql interface
rows, _ := db.Query("SELECT id, properties FROM vertices WHERE label = ?", "user")

Assessment:

✅ Pure Go option (modernc.org/sqlite) - no CGo
✅ Standard database/sql interface
✅ JSON1 extension for JSONB-like operations
⚠️ CGo version (mattn/go-sqlite3) more mature but complicates cross-compile

Data Model Fit (25/30) ✅

Schema (identical to PostgreSQL):

CREATE TABLE vertices (
    id INTEGER PRIMARY KEY,
    label TEXT NOT NULL,
    properties JSON  -- JSON1 extension
);

CREATE TABLE edges (
    src_id INTEGER,
    dst_id INTEGER,
    label TEXT,
    properties JSON,
    PRIMARY KEY (src_id, dst_id, label)
);

JSON Operations:

-- JSON extraction (requires JSON1 extension)
SELECT * FROM vertices WHERE json_extract(properties, '$.country') = 'USA';

Assessment:

✅ Same schema as PostgreSQL (easy migration)
✅ JSON1 extension for property queries
✅ Recursive CTEs supported
⚠️ Performance degrades >10M rows
⚠️ Single-writer limitation

Testing Difficulty (20/20) ✅

Local Testing:

func TestSQLiteGraph(t *testing.T) {
    // In-memory database
    db, _ := sql.Open("sqlite", ":memory:")

    // Or: temporary file
    db, _ := sql.Open("sqlite", t.TempDir()+"/test.db")

    // Run migrations, seed data
    // ... test code
}

Assessment:

✅ Best testing experience - no external dependencies
✅ In-memory mode for ultra-fast tests
✅ Zero setup, zero teardown
✅ Perfect for CI/CD

Operational Complexity (12/20) ⚠️

Deployment:

✅ Embedded = zero deployment complexity
❌ Single-node only (no replication)
❌ Single-writer (write concurrency limited)

Monitoring:

⚠️ Limited built-in metrics
⚠️ Must implement application-level monitoring

Scaling:

❌ No horizontal scaling
⚠️ Vertical scaling limited by single file I/O

Assessment: Perfect for development, unsuitable for distributed production.

Overall Assessment

Strengths:

✅ Best testing experience (in-memory, no dependencies)
✅ Pure Go option available
✅ Same schema as PostgreSQL
✅ Perfect for CI/CD pipelines

Weaknesses:

❌ Single-node only
❌ Limited to ~10M rows before performance degrades
❌ Single-writer concurrency

Use Case: ✅ Development, testing, CI/CD - not production at scale

Rank #4: S3/MinIO (Score: 80/100) ✅

Overview

Type: Object storage Best For: Cold tier snapshots, bulk data loading Used In: RFC-059 (90% cold data), RFC-057 (bulk snapshots)

Go SDK Quality (28/30) ✅

// AWS SDK: github.com/aws/aws-sdk-go-v2
// MinIO SDK: github.com/minio/minio-go/v7
import "github.com/aws/aws-sdk-go-v2/service/s3"

cfg, _ := config.LoadDefaultConfig(ctx)
client := s3.NewFromConfig(cfg)

// Upload partition snapshot
_, err := client.PutObject(ctx, &s3.PutObjectInput{
    Bucket: aws.String("graph-snapshots"),
    Key:    aws.String("partition-123.parquet"),
    Body:   snapshotReader,
})

Assessment:

✅ Official AWS SDK v2 (excellent)
✅ MinIO SDK compatible with S3 API
✅ Excellent documentation
✅ Concurrent uploads/downloads
⚠️ HTTP-based (not as ergonomic as native protocols)

Data Model Fit (20/30) ⚠️

Snapshot Format (from RFC-059):

// Parquet columnar format
type PartitionSnapshot struct {
    Vertices []Vertex  // Columnar: ID, Label, Properties
    Edges    []Edge    // Columnar: SrcID, DstID, Label
    Metadata Metadata  // Version, timestamp, checksum
}

// S3 key structure
// s3://bucket/snapshots/v1/cluster-1/partition-0001/2025-11-16T00:00:00Z.parquet

Operations:

// Parallel load (1000 workers)
for partitionID := 0; partitionID < 1000; partitionID++ {
    go func(id int) {
        resp, _ := client.GetObject(ctx, &s3.GetObjectInput{
            Bucket: aws.String("graph-snapshots"),
            Key:    aws.String(fmt.Sprintf("partition-%04d.parquet", id)),
        })
        // Decompress and load into memory
    }(partitionID)
}

Assessment:

✅ Perfect for immutable snapshots
✅ Parallel loading (1000 workers = 60 seconds for 10 TB, per RFC-059)
✅ Versioning, lifecycle policies
❌ No random access to individual vertices
❌ Not suitable for transactional workloads

Testing Difficulty (18/20) ✅

Local Testing:

# MinIO (S3-compatible)
podman run -d --name minio \
    -p 9000:9000 -p 9001:9001 \
    -e MINIO_ROOT_USER=minioadmin \
    -e MINIO_ROOT_PASSWORD=minioadmin \
    minio/minio server /data --console-address ":9001"

Go Tests:

func TestS3Snapshots(t *testing.T) {
    // Use testcontainers-go with MinIO
    minioC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: testcontainers.ContainerRequest{
            Image:        "minio/minio",
            Env:          map[string]string{...},
            ExposedPorts: []string{"9000/tcp"},
        },
        Started: true,
    })
    // ... test code
}

Assessment:

✅ MinIO provides S3-compatible local testing
✅ Testcontainers-go support
⚠️ Parquet encoding/decoding adds test complexity

Operational Complexity (14/20) ⚠️

Deployment:

✅ S3 = fully managed (AWS)
✅ MinIO = self-hosted alternative
✅ No state to manage (immutable objects)

Monitoring:

✅ CloudWatch metrics (S3)
✅ Prometheus exporter (MinIO)
⚠️ Request costs require careful tracking (RFC-059 finding)

Scaling:

✅ Infinite horizontal scaling
✅ 99.999999999% durability (S3)
⚠️ Request costs scale with operations

Assessment: Excellent for cold storage, request costs require monitoring.

Overall Assessment

Strengths:

✅ Excellent for immutable snapshots
✅ 95% cost reduction vs all-in-memory (RFC-059)
✅ Infinite scaling
✅ MinIO enables local testing

Weaknesses:

❌ No random access to vertices
❌ Not suitable for transactional workloads
⚠️ Request costs can exceed storage costs

Use Case: ✅ Cold tier (90% of data) as validated by RFC-059

Rank #5: ClickHouse (Score: 75/100) ⚠️

Overview

Type: Columnar OLAP database Best For: Time-series analytics, audit logs, query statistics Used In: Potential for RFC-061 audit log storage

Go SDK Quality (26/30) ✅

// Official: github.com/ClickHouse/clickhouse-go/v2
import "github.com/ClickHouse/clickhouse-go/v2"

conn, _ := clickhouse.Open(&clickhouse.Options{
    Addr: []string{"localhost:9000"},
})

// Query with strong typing
rows, _ := conn.Query(ctx, "SELECT event_id, timestamp, vertex_id FROM audit_log WHERE timestamp > ?", time.Now().Add(-1*time.Hour))

Assessment:

✅ Official Go SDK
✅ Good documentation
✅ Native protocol (not HTTP)
⚠️ API less ergonomic than PostgreSQL

Data Model Fit (18/30) ⚠️

Schema Design:

-- Audit log (time-series)
CREATE TABLE audit_log (
    event_id UInt64,
    timestamp DateTime,
    user_id UInt64,
    action String,
    vertex_id UInt64,
    details String  -- JSON string
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);

-- Query statistics (aggregations)
CREATE TABLE query_stats (
    query_id UInt64,
    timestamp DateTime,
    latency_ms UInt32,
    vertices_scanned UInt64,
    partition_id UInt32
) ENGINE = MergeTree()
ORDER BY timestamp;

Queries:

-- Fast time-range scans
SELECT COUNT(*) FROM audit_log
WHERE timestamp BETWEEN '2025-11-01' AND '2025-11-16';

-- Aggregations
SELECT
    toStartOfHour(timestamp) as hour,
    COUNT(*) as event_count,
    avg(latency_ms) as avg_latency
FROM query_stats
GROUP BY hour
ORDER BY hour;

Assessment:

✅ Excellent for time-series data (audit logs, metrics)
✅ Fast aggregations and analytics
❌ Poor fit for transactional graph operations
❌ No support for random vertex updates

Testing Difficulty (16/20) ⚠️

Local Testing:

# ClickHouse container
podman run -d --name clickhouse \
    -p 9000:9000 -p 8123:8123 \
    clickhouse/clickhouse-server

Assessment:

✅ Docker/Podman support
⚠️ Slower startup (~5-10 seconds)
⚠️ Schema migrations more complex
⚠️ Test data generation for columnar format

Operational Complexity (15/20) ⚠️

Deployment:

✅ Official Kubernetes operator
✅ Horizontal scaling (sharding, replication)
⚠️ Complex configuration for production

Monitoring:

✅ Built-in system tables (system.metrics, system.events)
✅ Prometheus exporter
⚠️ Requires expertise to tune

Scaling:

✅ Excellent horizontal scaling
✅ Columnar compression (10-100× better than row-based)
⚠️ Rebalancing shards requires planning

Assessment: Powerful for analytics, requires operational expertise.

Overall Assessment

Strengths:

✅ Excellent for audit logs (RFC-061)
✅ Fast time-series queries
✅ Columnar compression (99% reduction, per RFC-061)

Weaknesses:

❌ Poor fit for transactional graph operations
⚠️ Operational complexity
⚠️ Not suitable for hot path queries

Use Case: ⚠️ Specialized - audit logs and analytics only

Rank #6: Kafka (Score: 70/100) ⚠️

Overview

Type: Distributed event streaming Best For: Event sourcing, change data capture, real-time updates Used In: Potential for graph mutation streams

Go SDK Quality (25/30) ✅

// Popular: github.com/segmentio/kafka-go
import "github.com/segmentio/kafka-go"

writer := kafka.NewWriter(kafka.WriterConfig{
    Brokers: []string{"localhost:9092"},
    Topic:   "graph-mutations",
})

// Write vertex update event
writer.WriteMessages(ctx, kafka.Message{
    Key:   []byte("vertex:123"),
    Value: vertexUpdateJSON,
})

Assessment:

✅ Excellent Go libraries (segmentio/kafka-go, Shopify/sarama)
✅ Good documentation
⚠️ API complexity for distributed systems newcomers

Data Model Fit (15/30) ⚠️

Event Sourcing Model:

// Event stream (not direct vertex storage)
type GraphEvent struct {
    EventType string  // "VertexCreated", "EdgeAdded", "PropertyUpdated"
    VertexID  string
    Payload   json.RawMessage
    Timestamp time.Time
}

// Consumers rebuild graph state
// Topic: graph-mutations
// Partition key: VertexID (ensures order per vertex)

Assessment:

⚠️ Not a storage backend - event streaming only
⚠️ Requires separate storage for materialized views
✅ Good for change data capture (CDC)
✅ Enables time-travel queries (replay events)

Testing Difficulty (14/20) ⚠️

Local Testing:

# Kafka + Zookeeper
podman run -d --name kafka \
    -p 9092:9092 \
    -e KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 \
    confluentinc/cp-kafka

# Requires Zookeeper dependency
podman run -d --name zookeeper \
    -p 2181:2181 \
    confluentinc/cp-zookeeper

Assessment:

⚠️ Requires multiple containers (Kafka + Zookeeper or KRaft)
⚠️ Slower startup (~10-15 seconds)
⚠️ Topic management in tests adds complexity

Operational Complexity (16/20) ⚠️

Deployment:

✅ Mature Kubernetes operators (Strimzi)
✅ Excellent horizontal scaling
⚠️ Requires careful partition management

Monitoring:

✅ JMX metrics, Prometheus exporters
✅ Excellent observability
⚠️ Many metrics to track

Scaling:

✅ Horizontal scaling via partitions
✅ High throughput (millions of events/sec)
⚠️ Rebalancing can cause lag spikes

Assessment: Powerful for event streaming, overkill for simple graph storage.

Overall Assessment

Strengths:

✅ Excellent for change data capture
✅ Enables event sourcing patterns
✅ High throughput

Weaknesses:

❌ Not a storage backend (requires materialized views)
⚠️ Operational complexity
⚠️ Testing complexity (multiple dependencies)

Use Case: ⚠️ Specialized - event sourcing and CDC only

Rank #7: NATS (Score: 65/100) ⚠️

Overview

Type: Messaging system Best For: Real-time pub/sub, distributed cache invalidation Used In: Potential for cache invalidation notifications

Go SDK Quality (27/30) ✅

// Official: github.com/nats-io/nats.go
import "github.com/nats-io/nats.go"

nc, _ := nats.Connect("nats://localhost:4222")

// Publish cache invalidation
nc.Publish("cache.invalidate.vertex.123", []byte("invalidate"))

// Subscribe to invalidations
nc.Subscribe("cache.invalidate.*", func(msg *nats.Msg) {
    // Handle invalidation
})

Assessment:

✅ Excellent official Go SDK
✅ Simple, idiomatic API
✅ Strong typing
✅ Native Go (no CGo)

Data Model Fit (12/30) ❌

Messaging Only:

// NATS is NOT a storage backend
// Use case: Notify proxies of vertex updates
type InvalidationMessage struct {
    VertexID   string
    PartitionID int
    Timestamp  time.Time
}

Assessment:

❌ Not a storage backend - messaging only
❌ No persistence (JetStream adds persistence but still not a database)
✅ Good for cache invalidation notifications

Testing Difficulty (18/20) ✅

Local Testing:

# NATS server (single binary)
podman run -d --name nats -p 4222:4222 nats:latest

Assessment:

✅ Single binary, fast startup (<1 second)
✅ Simple testcontainers-go integration
✅ Embedded mode for tests (nats-server package)

Operational Complexity (8/20) ❌

Deployment:

✅ Simple deployment
✅ Kubernetes operators available
❌ Not a storage system

Monitoring:

✅ Prometheus exporter
✅ Built-in metrics endpoint

Scaling:

✅ Horizontal scaling (clustering)
❌ Not applicable for storage

Assessment: Excellent for messaging, not a storage backend.

Overall Assessment

Strengths:

✅ Excellent Go SDK
✅ Fast, lightweight
✅ Perfect for cache invalidation

Weaknesses:

❌ Not a storage backend
❌ Limited use case for graph storage

Use Case: ⚠️ Specialized - cache invalidation only

Rank #8: Neptune (Score: 50/100) ❌

Overview

Type: Native graph database (AWS managed) Best For: AWS-native graph applications Used In: Potential alternative to custom graph implementation

Go SDK Quality (10/30) ❌

Problem: No official Go SDK for Gremlin

// Must use HTTP/WebSocket client
import "github.com/go-gremlin/gremlin"

// Unofficial, limited support
client, _ := gremlin.NewClient("ws://neptune-endpoint:8182/gremlin")

query := "g.V('123').outE('friend').inV()"
results, _ := client.Execute(query)

Assessment:

❌ No official Go SDK
❌ HTTP/WebSocket-based (not native protocol)
⚠️ Unofficial libraries with limited support
⚠️ Gremlin query strings (no type safety)

Data Model Fit (30/30) ✅

Native Graph Model:

// Gremlin queries (native graph traversals)
// Add vertex
g.addV('user').property('id', '123').property('name', 'Alice')

// Add edge
g.V('123').addE('friend').to(V('456'))

// Traverse
g.V('123').outE('friend').inV().values('name')

// Complex traversal (2-hop)
g.V('123').out('friend').out('friend').dedup()

Assessment:

✅ Best data model for graphs
✅ Native graph traversals
✅ Graph algorithms (PageRank, shortest path)
✅ No impedance mismatch

Testing Difficulty (5/20) ❌

Problem: No local Neptune

# No Docker/Podman image
# Must use:
# 1. AWS Neptune (expensive for CI/CD)
# 2. TinkerGraph (in-memory, different semantics)
# 3. JanusGraph (different backend, setup complex)

Assessment:

❌ No local testing option
❌ Expensive to use real Neptune for CI/CD
❌ Alternatives (TinkerGraph) have different behavior
❌ Slow test feedback loop

Operational Complexity (5/20) ❌

Deployment:

❌ AWS-only (vendor lock-in)
❌ Cannot self-host
⚠️ Limited region availability
⚠️ Expensive ($0.58/hour for smallest instance)

Monitoring:

✅ CloudWatch metrics
⚠️ Limited visibility into query execution

Scaling:

✅ Read replicas
⚠️ Vertical scaling only (instance size)
❌ No horizontal sharding

Assessment: AWS-only is major limitation for local development and multi-cloud.

Overall Assessment

Strengths:

✅ Best native graph model
✅ Built-in graph algorithms
✅ Fully managed (AWS)

Weaknesses:

❌ No official Go SDK
❌ No local testing
❌ AWS vendor lock-in
❌ Expensive
❌ Cannot self-host

Use Case: ❌ Not recommended due to lack of Go SDK and local testing

Alternative: Build custom graph layer on top of Redis + PostgreSQL + S3 (as validated by RFC-057, RFC-059) provides better control and testability.

Recommendations

Primary Recommendation: Hybrid Approach ✅

Architecture (from RFC-059):

┌─────────────────────────────────────────────────────┐
│ Hot Tier (10%): Redis                               │
│ - 10B most-accessed vertices                        │
│ - Sub-millisecond latency                           │
│ - 21 TB RAM across 1000 nodes                       │
│ - Cost: $587k/month                                 │
└─────────────────────────────────────────────────────┘
                          │
                          │ Temperature-based eviction
                          ↓
┌─────────────────────────────────────────────────────┐
│ Cold Tier (90%): S3/MinIO                           │
│ - 90B cold vertices                                 │
│ - 50-200ms latency (parallel load)                  │
│ - Parquet snapshots                                 │
│ - Cost: $4.3k/month                                 │
└─────────────────────────────────────────────────────┘
                          │
                          │ Metadata queries
                          ↓
┌─────────────────────────────────────────────────────┐
│ Metadata: PostgreSQL                                │
│ - Partition metadata                                │
│ - Index structures (RFC-058)                        │
│ - Configuration                                     │
│ - Cost: $500/month                                  │
└─────────────────────────────────────────────────────┘

Rationale:

✅ Redis for hot tier (10%) - validated by RFC-059 (95% cost reduction)
✅ S3 for cold tier (90%) - 60-second recovery time
✅ PostgreSQL for metadata - JSONB indexes
✅ All three have excellent Go SDKs and local testing
✅ Total cost: ~$592k/month vs $105M/month (all in-memory)

Implementation Phases

Phase 1: Core Storage (Weeks 13-14)

Components:

Redis hot tier driver
S3/MinIO cold tier driver
PostgreSQL metadata driver

Go Packages:

pkg/storage/
├── interface.go        // Storage interface
├── redis/
│   ├── driver.go      // Redis hot tier
│   └── driver_test.go
├── s3/
│   ├── driver.go      // S3 cold tier
│   └── driver_test.go
└── postgres/
    ├── driver.go      // PostgreSQL metadata
    └── driver_test.go

Phase 2: Testing Infrastructure (Week 14)

Local Stack:

# docker-compose.yml or Podman equivalent
services:
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

  minio:
    image: minio/minio
    command: server /data --console-address ":9001"
    ports: ["9000:9000", "9001:9001"]

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: secret
    ports: ["5432:5432"]

Test Helpers:

// pkg/testing/storage.go
func NewTestStorageBackends(t *testing.T) (*redis.Client, *s3.Client, *pgxpool.Pool) {
    // Use testcontainers-go to spin up all three
    // Return clients ready for testing
}

Phase 3: Specialized Backends (Week 15-16)

Optional additions based on workload:

ClickHouse for audit logs (RFC-061)
Kafka for event sourcing
NATS for cache invalidation

Cost Analysis

Monthly Operational Costs (100B Vertices)

Backend	Use Case	Nodes	Cost/Month	% of Total
Redis	Hot tier (10%)	1000 × 32 GB	$587,347	99.1%
S3	Cold tier (90%)	189 TB	$4,347	0.7%
PostgreSQL	Metadata	3 replicas	$500	0.1%
ClickHouse	Audit logs	10 nodes	$1,000	0.2%
Total			$593,194	100%

Cost Breakdown:

Hot tier dominates costs (99%)
Cold tier is negligible (0.7%)
Total is 0.56% of all-in-memory cost ($105M/month)

Savings: 99.44% reduction ($104.4M → $593k/month)

Next Steps

Week 14: Performance Benchmarking

Focus: Validate storage backend performance under load

Tasks:

Benchmark Redis hot tier (latency, throughput)
Benchmark S3 cold tier (parallel load performance)
Benchmark PostgreSQL metadata queries
Measure temperature-based eviction performance
Validate 60-second recovery time (RFC-059 claim)

Week 15: Disaster Recovery and Data Lifecycle

Focus: Backup, restore, replication strategies

Tasks:

Redis persistence (RDB vs AOF trade-offs)
S3 versioning and lifecycle policies
PostgreSQL streaming replication
Cross-region disaster recovery
RPO/RTO validation

Week 16: Comprehensive Cost Analysis

Focus: Detailed cost modeling and optimization

Tasks:

Detailed AWS/GCP/Azure pricing comparison
Request cost analysis (S3 GET/PUT costs)
Network egress costs
Reserved instance vs on-demand savings
Cost optimization recommendations

Appendices

Appendix A: Backend Scoring Rubric

Go SDK Quality (30 points):

Official SDK: +15 points
Good documentation: +5 points
Active community: +5 points
Idiomatic Go: +5 points

Data Model Fit (30 points):

Native graph support: +30 points
Relational with JSON: +25 points
Key-value: +20 points
Event streaming: +15 points
Object storage: +10 points

Testing Difficulty (20 points):

In-memory/embedded: +20 points
Single container: +18 points
Multiple containers: +14 points
External service only: +5 points

Operational Complexity (20 points):

Mature tooling: +10 points
Easy deployment: +5 points
Good monitoring: +3 points
Horizontal scaling: +2 points

Appendix B: Data Model Comparison

Backend	Vertices	Edges	Properties	Traversals
Neptune	Native	Native	Native	Native ✅
Redis	Hash/JSON	Sorted Sets	Hash/JSON	Application
PostgreSQL	Table + JSONB	Table	JSONB	CTE (3 hops)
SQLite	Table + JSON	Table	JSON	CTE (3 hops)
S3/MinIO	Parquet	Parquet	Columns	Bulk only
ClickHouse	Table	Table	Columns	Analytics only
Kafka	N/A	N/A	N/A	N/A
NATS	N/A	N/A	N/A	N/A

Appendix C: Testing Comparison

Backend	Startup Time	Dependencies	Testcontainers	CI/CD Friendly
SQLite	<1ms	None	N/A	✅ Excellent
Redis	~1s	None	✅ Yes	✅ Excellent
PostgreSQL	~3s	None	✅ Yes	✅ Good
NATS	~1s	None	✅ Yes	✅ Excellent
MinIO	~2s	None	✅ Yes	✅ Good
ClickHouse	~10s	None	✅ Yes	⚠️ Moderate
Kafka	~15s	Zookeeper	✅ Yes	⚠️ Moderate
Neptune	N/A	AWS Account	❌ No	❌ Poor

Executive Summary​

Methodology​

Evaluation Criteria​

Data Models Supported​

Findings​

Backend Ranking Summary​

Detailed Backend Evaluation​

Rank #1: Redis (Score: 95/100) ✅​

Overview​

Go SDK Quality (30/30) ✅​

Data Model Fit (30/30) ✅​

Testing Difficulty (20/20) ✅​

Operational Complexity (15/20) ✅​

Overall Assessment​

Rank #2: PostgreSQL (Score: 90/100) ✅​

Overview​

Go SDK Quality (30/30) ✅​

Data Model Fit (25/30) ✅​

Testing Difficulty (20/20) ✅​

Operational Complexity (15/20) ✅​

Overall Assessment​

Rank #3: SQLite (Score: 85/100) ✅​

Overview​

Go SDK Quality (28/30) ✅​

Data Model Fit (25/30) ✅​

Testing Difficulty (20/20) ✅​

Operational Complexity (12/20) ⚠️​

Overall Assessment​

Rank #4: S3/MinIO (Score: 80/100) ✅​

Overview​

Go SDK Quality (28/30) ✅​

Data Model Fit (20/30) ⚠️​

Testing Difficulty (18/20) ✅​

Operational Complexity (14/20) ⚠️​

Overall Assessment​

Rank #5: ClickHouse (Score: 75/100) ⚠️​

Overview​

Go SDK Quality (26/30) ✅​

Data Model Fit (18/30) ⚠️​

Testing Difficulty (16/20) ⚠️​

Operational Complexity (15/20) ⚠️​

Overall Assessment​

Rank #6: Kafka (Score: 70/100) ⚠️​

Overview​

Go SDK Quality (25/30) ✅​

Data Model Fit (15/30) ⚠️​

Testing Difficulty (14/20) ⚠️​

Operational Complexity (16/20) ⚠️​

Overall Assessment​

Rank #7: NATS (Score: 65/100) ⚠️​

Overview​

Go SDK Quality (27/30) ✅​

Data Model Fit (12/30) ❌​

Testing Difficulty (18/20) ✅​

Operational Complexity (8/20) ❌​

Overall Assessment​

Rank #8: Neptune (Score: 50/100) ❌​

Overview​

Go SDK Quality (10/30) ❌​

Data Model Fit (30/30) ✅​

Testing Difficulty (5/20) ❌​

Operational Complexity (5/20) ❌​

Overall Assessment​

Recommendations​

Primary Recommendation: Hybrid Approach ✅​

Implementation Phases​

Phase 1: Core Storage (Weeks 13-14)​

Phase 2: Testing Infrastructure (Week 14)​

Phase 3: Specialized Backends (Week 15-16)​

Cost Analysis​

Monthly Operational Costs (100B Vertices)​

Next Steps​

Week 14: Performance Benchmarking​

Week 15: Disaster Recovery and Data Lifecycle​

Week 16: Comprehensive Cost Analysis​

Appendices​

Appendix A: Backend Scoring Rubric​

Appendix B: Data Model Comparison​

Appendix C: Testing Comparison​

Executive Summary

Methodology

Evaluation Criteria

Data Models Supported

Findings

Backend Ranking Summary

Detailed Backend Evaluation

Rank #1: Redis (Score: 95/100) ✅

Overview

Go SDK Quality (30/30) ✅

Data Model Fit (30/30) ✅

Testing Difficulty (20/20) ✅

Operational Complexity (15/20) ✅

Overall Assessment

Rank #2: PostgreSQL (Score: 90/100) ✅

Overview

Go SDK Quality (30/30) ✅

Data Model Fit (25/30) ✅

Testing Difficulty (20/20) ✅

Operational Complexity (15/20) ✅

Overall Assessment

Rank #3: SQLite (Score: 85/100) ✅

Overview

Go SDK Quality (28/30) ✅

Data Model Fit (25/30) ✅

Testing Difficulty (20/20) ✅

Operational Complexity (12/20) ⚠️

Overall Assessment

Rank #4: S3/MinIO (Score: 80/100) ✅

Overview

Go SDK Quality (28/30) ✅

Data Model Fit (20/30) ⚠️

Testing Difficulty (18/20) ✅

Operational Complexity (14/20) ⚠️

Overall Assessment

Rank #5: ClickHouse (Score: 75/100) ⚠️

Overview

Go SDK Quality (26/30) ✅

Data Model Fit (18/30) ⚠️

Testing Difficulty (16/20) ⚠️

Operational Complexity (15/20) ⚠️

Overall Assessment

Rank #6: Kafka (Score: 70/100) ⚠️

Overview

Go SDK Quality (25/30) ✅

Data Model Fit (15/30) ⚠️

Testing Difficulty (14/20) ⚠️

Operational Complexity (16/20) ⚠️

Overall Assessment

Rank #7: NATS (Score: 65/100) ⚠️

Overview

Go SDK Quality (27/30) ✅

Data Model Fit (12/30) ❌

Testing Difficulty (18/20) ✅

Operational Complexity (8/20) ❌

Overall Assessment

Rank #8: Neptune (Score: 50/100) ❌

Overview

Go SDK Quality (10/30) ❌

Data Model Fit (30/30) ✅

Testing Difficulty (5/20) ❌

Operational Complexity (5/20) ❌

Overall Assessment

Recommendations

Primary Recommendation: Hybrid Approach ✅

Implementation Phases

Phase 1: Core Storage (Weeks 13-14)

Phase 2: Testing Infrastructure (Week 14)

Phase 3: Specialized Backends (Week 15-16)

Cost Analysis

Monthly Operational Costs (100B Vertices)

Next Steps

Week 14: Performance Benchmarking

Week 15: Disaster Recovery and Data Lifecycle

Week 16: Comprehensive Cost Analysis

Appendices

Appendix A: Backend Scoring Rubric

Appendix B: Data Model Comparison

Appendix C: Testing Comparison