Skip to main content

MEMO-073: Week 13 - Storage Backend Evaluation for Massive-Scale Graphs

Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-052, RFC-057, RFC-058, RFC-059

Executive Summary

Goal: Evaluate storage backend options for 100B vertex graph system

Scope: 8 storage backends ranked by implementability for graph workloads

Findings:

  • Best for graphs: Neptune (native graph), TigerGraph (native graph)
  • Best for scale: S3/MinIO (cold storage), ClickHouse (time-series)
  • Most practical: PostgreSQL + pg_timbala (relational), Redis (in-memory)
  • Implementability winner: Redis (rank #1, score 95/100)
  • Cost winner: S3/MinIO (cold storage tier)

Recommendation: Hybrid approach - Redis (hot tier) + S3 (cold tier) + PostgreSQL (metadata) as validated by RFC-059.


Methodology

Evaluation Criteria

Implementability Score (0-100):

  1. Go SDK Quality (30 points): Official SDK, community support, documentation
  2. Data Model Fit (30 points): How naturally backend supports graph operations
  3. Testing Difficulty (20 points): Local testing, Docker support, test data generation
  4. Operational Complexity (20 points): Deployment, monitoring, scaling

Data Models Supported

For graph workloads, backends must support:

  • Vertices: Key-value or document storage
  • Edges: Adjacency lists or edge tables
  • Properties: Nested attributes on vertices/edges
  • Indexes: Property lookups, traversal optimization
  • Partitioning: Distribute across nodes

Findings

Backend Ranking Summary

RankBackendScoreGo SDKData ModelTestingBest For
1Redis95/100✅ Excellent✅ Graph-friendly✅ EasyHot tier caching
2PostgreSQL90/100✅ Excellent✅ Good (JSONB)✅ EasyMetadata, indexes
3SQLite85/100✅ Good✅ Good (JSON)✅ TrivialDev/testing
4S3/MinIO80/100✅ Good⚠️ Snapshot only✅ EasyCold storage
5ClickHouse75/100✅ Good⚠️ Time-series⚠️ ModerateAnalytics
6Kafka70/100✅ Good⚠️ Event stream⚠️ ModerateEvent sourcing
7NATS65/100✅ Good⚠️ Messaging⚠️ ModeratePub/sub
8Neptune50/100❌ None (HTTP)✅ Native graph❌ HardAWS-only graphs

Key Insight: Native graph databases (Neptune, TigerGraph) score lowest on implementability despite best data model fit, due to lack of Go SDK and testing complexity.


Detailed Backend Evaluation

Rank #1: Redis (Score: 95/100) ✅

Overview

Type: In-memory key-value store with data structures Best For: Hot tier vertex/edge caching, real-time access patterns Used In: RFC-057 (hot tier), RFC-059 (10% hot data)

Go SDK Quality (30/30) ✅

// Official: github.com/redis/go-redis/v9
import "github.com/redis/go-redis/v9"

client := redis.NewClient(&redis.Options{
Addr: "localhost:6379",
DB: 0,
})

// Excellent API, strong typing, context support
ctx := context.Background()
err := client.Set(ctx, "vertex:123", vertexJSON, 0).Err()

Assessment:

  • ✅ Official Go SDK maintained by Redis
  • ✅ Excellent documentation with examples
  • ✅ Strong community (19k+ GitHub stars)
  • ✅ Context-aware, idiomatic Go
  • ✅ Pipelining, transactions, pub/sub support

Data Model Fit (30/30) ✅

Vertex Storage:

// Option 1: Hash (structured)
client.HSet(ctx, "vertex:user:123", map[string]interface{}{
"id": "123",
"name": "Alice",
"age": 30,
"country": "USA",
})

// Option 2: JSON (Redis Stack)
client.JSONSet(ctx, "vertex:user:123", "$", vertexStruct)

Edge Storage (Adjacency Lists):

// Sorted set for edges (score = timestamp or weight)
client.ZAdd(ctx, "edges:user:123:friends", redis.Z{
Score: float64(time.Now().Unix()),
Member: "user:456",
})

// Retrieve friends
friends := client.ZRange(ctx, "edges:user:123:friends", 0, -1)

Indexes:

// Secondary indexes via sets
client.SAdd(ctx, "idx:country:USA", "user:123", "user:456")

// Retrieve all users in USA
usersInUSA := client.SMembers(ctx, "idx:country:USA")

Assessment:

  • ✅ Native support for adjacency lists (sorted sets)
  • ✅ Efficient property indexes (sets)
  • ✅ JSON support via Redis Stack module
  • ✅ Atomic operations for consistency
  • ⚠️ No native graph traversal (implement in application)

Testing Difficulty (20/20) ✅

Local Testing:

# Podman/Docker
podman run -d --name redis -p 6379:6379 redis:7-alpine

# Or: redis-server (native install)
brew install redis
redis-server

Go Test Integration:

func TestRedisVertex(t *testing.T) {
// Use testcontainers-go for isolated tests
ctx := context.Background()
redisC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "redis:7-alpine",
ExposedPorts: []string{"6379/tcp"},
},
Started: true,
})
defer redisC.Terminate(ctx)

// Connect and test
endpoint, _ := redisC.Endpoint(ctx, "")
client := redis.NewClient(&redis.Options{Addr: endpoint})
// ... test code
}

Assessment:

  • ✅ Single-binary, no dependencies
  • ✅ Instant startup (<1 second)
  • ✅ Excellent testcontainers-go support
  • ✅ In-memory = fast tests
  • ✅ No schema migrations needed

Operational Complexity (15/20) ✅

Deployment:

  • ✅ Stateless deployment with Redis Cluster
  • ✅ Excellent Kubernetes operators (Redis Enterprise, Bitnami)
  • ⚠️ Persistence requires RDB/AOF configuration
  • ⚠️ Memory management (eviction policies)

Monitoring:

  • ✅ Built-in INFO command exposes all metrics
  • ✅ Prometheus exporter available
  • ✅ Grafana dashboards

Scaling:

  • ✅ Horizontal: Redis Cluster (sharding)
  • ✅ Vertical: Add memory
  • ⚠️ Rebalancing requires cluster resharding

Assessment: Mature operational tooling, memory constraints require planning.

Overall Assessment

Strengths:

  • ✅ Best Go SDK of all backends
  • ✅ Perfect data model for hot tier graphs
  • ✅ Trivial local testing
  • ✅ Sub-millisecond latency
  • ✅ Battle-tested at scale (Twitter, GitHub, StackOverflow)

Weaknesses:

  • ⚠️ Memory-bound (expensive at 100B scale)
  • ⚠️ No native graph traversal (application-level)
  • ⚠️ Persistence trade-offs (RDB snapshots vs AOF overhead)

Use Case: ✅ Ideal for hot tier (10% of data) as validated by RFC-059


Rank #2: PostgreSQL (Score: 90/100) ✅

Overview

Type: Relational database with JSONB support Best For: Metadata, indexes, small-to-medium graphs Used In: RFC-058 (index storage), potential for partition metadata

Go SDK Quality (30/30) ✅

// Popular: github.com/lib/pq or github.com/jackc/pgx/v5
import "github.com/jackc/pgx/v5/pgxpool"

pool, _ := pgxpool.New(ctx, "postgres://user:pass@localhost:5432/graphdb")

// Excellent query builder, prepared statements
var vertex Vertex
err := pool.QueryRow(ctx,
"SELECT id, properties FROM vertices WHERE id = $1",
vertexID,
).Scan(&vertex.ID, &vertex.Properties)

Assessment:

  • ✅ Multiple excellent Go drivers (lib/pq, pgx)
  • ✅ Strong typing, connection pooling
  • ✅ Excellent documentation
  • ✅ Native support for JSON/JSONB
  • ✅ Prepared statements, batch operations

Data Model Fit (25/30) ✅

Schema Design:

-- Vertices table
CREATE TABLE vertices (
id BIGINT PRIMARY KEY,
label TEXT NOT NULL,
properties JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Edges table (adjacency list)
CREATE TABLE edges (
src_id BIGINT NOT NULL,
dst_id BIGINT NOT NULL,
label TEXT NOT NULL,
properties JSONB,
PRIMARY KEY (src_id, dst_id, label)
);

-- Indexes for traversal
CREATE INDEX idx_edges_src ON edges(src_id);
CREATE INDEX idx_edges_dst ON edges(dst_id);
CREATE INDEX idx_vertices_props ON vertices USING GIN(properties);

Graph Operations:

-- Find friends (1-hop)
SELECT v.* FROM vertices v
JOIN edges e ON e.dst_id = v.id
WHERE e.src_id = 123 AND e.label = 'friend';

-- Property filter
SELECT * FROM vertices
WHERE properties @> '{"country": "USA"}';

-- 2-hop traversal (CTE)
WITH RECURSIVE friends AS (
SELECT dst_id, 1 as depth FROM edges WHERE src_id = 123
UNION
SELECT e.dst_id, f.depth + 1
FROM edges e
JOIN friends f ON e.src_id = f.dst_id
WHERE f.depth < 2
)
SELECT v.* FROM vertices v JOIN friends f ON v.id = f.dst_id;

Assessment:

  • ✅ JSONB excellent for flexible properties
  • ✅ GIN indexes for JSONB queries
  • ✅ Recursive CTEs for traversals (up to ~3 hops practical)
  • ⚠️ Deep traversals (4+ hops) become expensive
  • ⚠️ No native graph algorithms

Testing Difficulty (20/20) ✅

Local Testing:

# Podman
podman run -d --name postgres \
-e POSTGRES_PASSWORD=secret \
-p 5432:5432 \
postgres:16-alpine

Test Helpers:

func TestPostgresGraph(t *testing.T) {
// Use testcontainers-go
ctx := context.Background()
pgC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "postgres:16-alpine",
Env: map[string]string{"POSTGRES_PASSWORD": "secret"},
ExposedPorts: []string{"5432/tcp"},
WaitingFor: wait.ForLog("database system is ready"),
},
Started: true,
})
defer pgC.Terminate(ctx)

// Run migrations, seed test data
// ... test code
}

Assessment:

  • ✅ Excellent testcontainers-go support
  • ✅ Fast startup (~3 seconds)
  • ✅ Schema migrations via goose/migrate
  • ✅ Test data generation straightforward

Operational Complexity (15/20) ✅

Deployment:

  • ✅ Mature Kubernetes operators (Crunchy, Zalando)
  • ✅ Excellent backup/restore (pg_dump, WAL archiving)
  • ✅ Streaming replication

Monitoring:

  • ✅ pg_stat_* views expose all metrics
  • ✅ Excellent Prometheus exporters
  • ✅ Deep observability (query plans, slow logs)

Scaling:

  • ✅ Vertical: Add CPU/memory/storage
  • ⚠️ Horizontal: Requires sharding (Citus, manual)
  • ⚠️ Large tables (>100M rows) need partitioning

Assessment: Excellent operational maturity, horizontal scaling requires extensions.

Overall Assessment

Strengths:

  • ✅ Excellent Go SDK (pgx)
  • ✅ JSONB perfect for flexible properties
  • ✅ Recursive CTEs for limited traversals
  • ✅ Trivial local testing
  • ✅ 40+ years of operational knowledge

Weaknesses:

  • ⚠️ Deep traversals (4+ hops) expensive
  • ⚠️ Horizontal scaling requires extensions
  • ⚠️ Not optimized for graph algorithms

Use Case: ✅ Ideal for metadata, indexes, small graphs (<1B vertices)


Rank #3: SQLite (Score: 85/100) ✅

Overview

Type: Embedded relational database Best For: Development, testing, single-node graphs Used In: Local development, CI/CD tests

Go SDK Quality (28/30) ✅

// Popular: github.com/mattn/go-sqlite3 (CGo) or modernc.org/sqlite (pure Go)
import (
"database/sql"
_ "modernc.org/sqlite" // Pure Go, no CGo
)

db, _ := sql.Open("sqlite", "graph.db")

// Standard database/sql interface
rows, _ := db.Query("SELECT id, properties FROM vertices WHERE label = ?", "user")

Assessment:

  • ✅ Pure Go option (modernc.org/sqlite) - no CGo
  • ✅ Standard database/sql interface
  • ✅ JSON1 extension for JSONB-like operations
  • ⚠️ CGo version (mattn/go-sqlite3) more mature but complicates cross-compile

Data Model Fit (25/30) ✅

Schema (identical to PostgreSQL):

CREATE TABLE vertices (
id INTEGER PRIMARY KEY,
label TEXT NOT NULL,
properties JSON -- JSON1 extension
);

CREATE TABLE edges (
src_id INTEGER,
dst_id INTEGER,
label TEXT,
properties JSON,
PRIMARY KEY (src_id, dst_id, label)
);

JSON Operations:

-- JSON extraction (requires JSON1 extension)
SELECT * FROM vertices WHERE json_extract(properties, '$.country') = 'USA';

Assessment:

  • ✅ Same schema as PostgreSQL (easy migration)
  • ✅ JSON1 extension for property queries
  • ✅ Recursive CTEs supported
  • ⚠️ Performance degrades >10M rows
  • ⚠️ Single-writer limitation

Testing Difficulty (20/20) ✅

Local Testing:

func TestSQLiteGraph(t *testing.T) {
// In-memory database
db, _ := sql.Open("sqlite", ":memory:")

// Or: temporary file
db, _ := sql.Open("sqlite", t.TempDir()+"/test.db")

// Run migrations, seed data
// ... test code
}

Assessment:

  • Best testing experience - no external dependencies
  • ✅ In-memory mode for ultra-fast tests
  • ✅ Zero setup, zero teardown
  • ✅ Perfect for CI/CD

Operational Complexity (12/20) ⚠️

Deployment:

  • ✅ Embedded = zero deployment complexity
  • ❌ Single-node only (no replication)
  • ❌ Single-writer (write concurrency limited)

Monitoring:

  • ⚠️ Limited built-in metrics
  • ⚠️ Must implement application-level monitoring

Scaling:

  • ❌ No horizontal scaling
  • ⚠️ Vertical scaling limited by single file I/O

Assessment: Perfect for development, unsuitable for distributed production.

Overall Assessment

Strengths:

  • Best testing experience (in-memory, no dependencies)
  • ✅ Pure Go option available
  • ✅ Same schema as PostgreSQL
  • ✅ Perfect for CI/CD pipelines

Weaknesses:

  • ❌ Single-node only
  • ❌ Limited to ~10M rows before performance degrades
  • ❌ Single-writer concurrency

Use Case: ✅ Development, testing, CI/CD - not production at scale


Rank #4: S3/MinIO (Score: 80/100) ✅

Overview

Type: Object storage Best For: Cold tier snapshots, bulk data loading Used In: RFC-059 (90% cold data), RFC-057 (bulk snapshots)

Go SDK Quality (28/30) ✅

// AWS SDK: github.com/aws/aws-sdk-go-v2
// MinIO SDK: github.com/minio/minio-go/v7
import "github.com/aws/aws-sdk-go-v2/service/s3"

cfg, _ := config.LoadDefaultConfig(ctx)
client := s3.NewFromConfig(cfg)

// Upload partition snapshot
_, err := client.PutObject(ctx, &s3.PutObjectInput{
Bucket: aws.String("graph-snapshots"),
Key: aws.String("partition-123.parquet"),
Body: snapshotReader,
})

Assessment:

  • ✅ Official AWS SDK v2 (excellent)
  • ✅ MinIO SDK compatible with S3 API
  • ✅ Excellent documentation
  • ✅ Concurrent uploads/downloads
  • ⚠️ HTTP-based (not as ergonomic as native protocols)

Data Model Fit (20/30) ⚠️

Snapshot Format (from RFC-059):

// Parquet columnar format
type PartitionSnapshot struct {
Vertices []Vertex // Columnar: ID, Label, Properties
Edges []Edge // Columnar: SrcID, DstID, Label
Metadata Metadata // Version, timestamp, checksum
}

// S3 key structure
// s3://bucket/snapshots/v1/cluster-1/partition-0001/2025-11-16T00:00:00Z.parquet

Operations:

// Parallel load (1000 workers)
for partitionID := 0; partitionID < 1000; partitionID++ {
go func(id int) {
resp, _ := client.GetObject(ctx, &s3.GetObjectInput{
Bucket: aws.String("graph-snapshots"),
Key: aws.String(fmt.Sprintf("partition-%04d.parquet", id)),
})
// Decompress and load into memory
}(partitionID)
}

Assessment:

  • ✅ Perfect for immutable snapshots
  • ✅ Parallel loading (1000 workers = 60 seconds for 10 TB, per RFC-059)
  • ✅ Versioning, lifecycle policies
  • ❌ No random access to individual vertices
  • ❌ Not suitable for transactional workloads

Testing Difficulty (18/20) ✅

Local Testing:

# MinIO (S3-compatible)
podman run -d --name minio \
-p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio server /data --console-address ":9001"

Go Tests:

func TestS3Snapshots(t *testing.T) {
// Use testcontainers-go with MinIO
minioC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "minio/minio",
Env: map[string]string{...},
ExposedPorts: []string{"9000/tcp"},
},
Started: true,
})
// ... test code
}

Assessment:

  • ✅ MinIO provides S3-compatible local testing
  • ✅ Testcontainers-go support
  • ⚠️ Parquet encoding/decoding adds test complexity

Operational Complexity (14/20) ⚠️

Deployment:

  • ✅ S3 = fully managed (AWS)
  • ✅ MinIO = self-hosted alternative
  • ✅ No state to manage (immutable objects)

Monitoring:

  • ✅ CloudWatch metrics (S3)
  • ✅ Prometheus exporter (MinIO)
  • ⚠️ Request costs require careful tracking (RFC-059 finding)

Scaling:

  • ✅ Infinite horizontal scaling
  • ✅ 99.999999999% durability (S3)
  • ⚠️ Request costs scale with operations

Assessment: Excellent for cold storage, request costs require monitoring.

Overall Assessment

Strengths:

  • ✅ Excellent for immutable snapshots
  • ✅ 95% cost reduction vs all-in-memory (RFC-059)
  • ✅ Infinite scaling
  • ✅ MinIO enables local testing

Weaknesses:

  • ❌ No random access to vertices
  • ❌ Not suitable for transactional workloads
  • ⚠️ Request costs can exceed storage costs

Use Case: ✅ Cold tier (90% of data) as validated by RFC-059


Rank #5: ClickHouse (Score: 75/100) ⚠️

Overview

Type: Columnar OLAP database Best For: Time-series analytics, audit logs, query statistics Used In: Potential for RFC-061 audit log storage

Go SDK Quality (26/30) ✅

// Official: github.com/ClickHouse/clickhouse-go/v2
import "github.com/ClickHouse/clickhouse-go/v2"

conn, _ := clickhouse.Open(&clickhouse.Options{
Addr: []string{"localhost:9000"},
})

// Query with strong typing
rows, _ := conn.Query(ctx, "SELECT event_id, timestamp, vertex_id FROM audit_log WHERE timestamp > ?", time.Now().Add(-1*time.Hour))

Assessment:

  • ✅ Official Go SDK
  • ✅ Good documentation
  • ✅ Native protocol (not HTTP)
  • ⚠️ API less ergonomic than PostgreSQL

Data Model Fit (18/30) ⚠️

Schema Design:

-- Audit log (time-series)
CREATE TABLE audit_log (
event_id UInt64,
timestamp DateTime,
user_id UInt64,
action String,
vertex_id UInt64,
details String -- JSON string
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);

-- Query statistics (aggregations)
CREATE TABLE query_stats (
query_id UInt64,
timestamp DateTime,
latency_ms UInt32,
vertices_scanned UInt64,
partition_id UInt32
) ENGINE = MergeTree()
ORDER BY timestamp;

Queries:

-- Fast time-range scans
SELECT COUNT(*) FROM audit_log
WHERE timestamp BETWEEN '2025-11-01' AND '2025-11-16';

-- Aggregations
SELECT
toStartOfHour(timestamp) as hour,
COUNT(*) as event_count,
avg(latency_ms) as avg_latency
FROM query_stats
GROUP BY hour
ORDER BY hour;

Assessment:

  • ✅ Excellent for time-series data (audit logs, metrics)
  • ✅ Fast aggregations and analytics
  • ❌ Poor fit for transactional graph operations
  • ❌ No support for random vertex updates

Testing Difficulty (16/20) ⚠️

Local Testing:

# ClickHouse container
podman run -d --name clickhouse \
-p 9000:9000 -p 8123:8123 \
clickhouse/clickhouse-server

Assessment:

  • ✅ Docker/Podman support
  • ⚠️ Slower startup (~5-10 seconds)
  • ⚠️ Schema migrations more complex
  • ⚠️ Test data generation for columnar format

Operational Complexity (15/20) ⚠️

Deployment:

  • ✅ Official Kubernetes operator
  • ✅ Horizontal scaling (sharding, replication)
  • ⚠️ Complex configuration for production

Monitoring:

  • ✅ Built-in system tables (system.metrics, system.events)
  • ✅ Prometheus exporter
  • ⚠️ Requires expertise to tune

Scaling:

  • ✅ Excellent horizontal scaling
  • ✅ Columnar compression (10-100× better than row-based)
  • ⚠️ Rebalancing shards requires planning

Assessment: Powerful for analytics, requires operational expertise.

Overall Assessment

Strengths:

  • ✅ Excellent for audit logs (RFC-061)
  • ✅ Fast time-series queries
  • ✅ Columnar compression (99% reduction, per RFC-061)

Weaknesses:

  • ❌ Poor fit for transactional graph operations
  • ⚠️ Operational complexity
  • ⚠️ Not suitable for hot path queries

Use Case: ⚠️ Specialized - audit logs and analytics only


Rank #6: Kafka (Score: 70/100) ⚠️

Overview

Type: Distributed event streaming Best For: Event sourcing, change data capture, real-time updates Used In: Potential for graph mutation streams

Go SDK Quality (25/30) ✅

// Popular: github.com/segmentio/kafka-go
import "github.com/segmentio/kafka-go"

writer := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: "graph-mutations",
})

// Write vertex update event
writer.WriteMessages(ctx, kafka.Message{
Key: []byte("vertex:123"),
Value: vertexUpdateJSON,
})

Assessment:

  • ✅ Excellent Go libraries (segmentio/kafka-go, Shopify/sarama)
  • ✅ Good documentation
  • ⚠️ API complexity for distributed systems newcomers

Data Model Fit (15/30) ⚠️

Event Sourcing Model:

// Event stream (not direct vertex storage)
type GraphEvent struct {
EventType string // "VertexCreated", "EdgeAdded", "PropertyUpdated"
VertexID string
Payload json.RawMessage
Timestamp time.Time
}

// Consumers rebuild graph state
// Topic: graph-mutations
// Partition key: VertexID (ensures order per vertex)

Assessment:

  • ⚠️ Not a storage backend - event streaming only
  • ⚠️ Requires separate storage for materialized views
  • ✅ Good for change data capture (CDC)
  • ✅ Enables time-travel queries (replay events)

Testing Difficulty (14/20) ⚠️

Local Testing:

# Kafka + Zookeeper
podman run -d --name kafka \
-p 9092:9092 \
-e KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 \
confluentinc/cp-kafka

# Requires Zookeeper dependency
podman run -d --name zookeeper \
-p 2181:2181 \
confluentinc/cp-zookeeper

Assessment:

  • ⚠️ Requires multiple containers (Kafka + Zookeeper or KRaft)
  • ⚠️ Slower startup (~10-15 seconds)
  • ⚠️ Topic management in tests adds complexity

Operational Complexity (16/20) ⚠️

Deployment:

  • ✅ Mature Kubernetes operators (Strimzi)
  • ✅ Excellent horizontal scaling
  • ⚠️ Requires careful partition management

Monitoring:

  • ✅ JMX metrics, Prometheus exporters
  • ✅ Excellent observability
  • ⚠️ Many metrics to track

Scaling:

  • ✅ Horizontal scaling via partitions
  • ✅ High throughput (millions of events/sec)
  • ⚠️ Rebalancing can cause lag spikes

Assessment: Powerful for event streaming, overkill for simple graph storage.

Overall Assessment

Strengths:

  • ✅ Excellent for change data capture
  • ✅ Enables event sourcing patterns
  • ✅ High throughput

Weaknesses:

  • ❌ Not a storage backend (requires materialized views)
  • ⚠️ Operational complexity
  • ⚠️ Testing complexity (multiple dependencies)

Use Case: ⚠️ Specialized - event sourcing and CDC only


Rank #7: NATS (Score: 65/100) ⚠️

Overview

Type: Messaging system Best For: Real-time pub/sub, distributed cache invalidation Used In: Potential for cache invalidation notifications

Go SDK Quality (27/30) ✅

// Official: github.com/nats-io/nats.go
import "github.com/nats-io/nats.go"

nc, _ := nats.Connect("nats://localhost:4222")

// Publish cache invalidation
nc.Publish("cache.invalidate.vertex.123", []byte("invalidate"))

// Subscribe to invalidations
nc.Subscribe("cache.invalidate.*", func(msg *nats.Msg) {
// Handle invalidation
})

Assessment:

  • ✅ Excellent official Go SDK
  • ✅ Simple, idiomatic API
  • ✅ Strong typing
  • ✅ Native Go (no CGo)

Data Model Fit (12/30) ❌

Messaging Only:

// NATS is NOT a storage backend
// Use case: Notify proxies of vertex updates
type InvalidationMessage struct {
VertexID string
PartitionID int
Timestamp time.Time
}

Assessment:

  • ❌ Not a storage backend - messaging only
  • ❌ No persistence (JetStream adds persistence but still not a database)
  • ✅ Good for cache invalidation notifications

Testing Difficulty (18/20) ✅

Local Testing:

# NATS server (single binary)
podman run -d --name nats -p 4222:4222 nats:latest

Assessment:

  • ✅ Single binary, fast startup (<1 second)
  • ✅ Simple testcontainers-go integration
  • ✅ Embedded mode for tests (nats-server package)

Operational Complexity (8/20) ❌

Deployment:

  • ✅ Simple deployment
  • ✅ Kubernetes operators available
  • ❌ Not a storage system

Monitoring:

  • ✅ Prometheus exporter
  • ✅ Built-in metrics endpoint

Scaling:

  • ✅ Horizontal scaling (clustering)
  • ❌ Not applicable for storage

Assessment: Excellent for messaging, not a storage backend.

Overall Assessment

Strengths:

  • ✅ Excellent Go SDK
  • ✅ Fast, lightweight
  • ✅ Perfect for cache invalidation

Weaknesses:

  • ❌ Not a storage backend
  • ❌ Limited use case for graph storage

Use Case: ⚠️ Specialized - cache invalidation only


Rank #8: Neptune (Score: 50/100) ❌

Overview

Type: Native graph database (AWS managed) Best For: AWS-native graph applications Used In: Potential alternative to custom graph implementation

Go SDK Quality (10/30) ❌

Problem: No official Go SDK for Gremlin

// Must use HTTP/WebSocket client
import "github.com/go-gremlin/gremlin"

// Unofficial, limited support
client, _ := gremlin.NewClient("ws://neptune-endpoint:8182/gremlin")

query := "g.V('123').outE('friend').inV()"
results, _ := client.Execute(query)

Assessment:

  • ❌ No official Go SDK
  • ❌ HTTP/WebSocket-based (not native protocol)
  • ⚠️ Unofficial libraries with limited support
  • ⚠️ Gremlin query strings (no type safety)

Data Model Fit (30/30) ✅

Native Graph Model:

// Gremlin queries (native graph traversals)
// Add vertex
g.addV('user').property('id', '123').property('name', 'Alice')

// Add edge
g.V('123').addE('friend').to(V('456'))

// Traverse
g.V('123').outE('friend').inV().values('name')

// Complex traversal (2-hop)
g.V('123').out('friend').out('friend').dedup()

Assessment:

  • Best data model for graphs
  • ✅ Native graph traversals
  • ✅ Graph algorithms (PageRank, shortest path)
  • ✅ No impedance mismatch

Testing Difficulty (5/20) ❌

Problem: No local Neptune

# No Docker/Podman image
# Must use:
# 1. AWS Neptune (expensive for CI/CD)
# 2. TinkerGraph (in-memory, different semantics)
# 3. JanusGraph (different backend, setup complex)

Assessment:

  • ❌ No local testing option
  • ❌ Expensive to use real Neptune for CI/CD
  • ❌ Alternatives (TinkerGraph) have different behavior
  • ❌ Slow test feedback loop

Operational Complexity (5/20) ❌

Deployment:

  • ❌ AWS-only (vendor lock-in)
  • ❌ Cannot self-host
  • ⚠️ Limited region availability
  • ⚠️ Expensive ($0.58/hour for smallest instance)

Monitoring:

  • ✅ CloudWatch metrics
  • ⚠️ Limited visibility into query execution

Scaling:

  • ✅ Read replicas
  • ⚠️ Vertical scaling only (instance size)
  • ❌ No horizontal sharding

Assessment: AWS-only is major limitation for local development and multi-cloud.

Overall Assessment

Strengths:

  • Best native graph model
  • ✅ Built-in graph algorithms
  • ✅ Fully managed (AWS)

Weaknesses:

  • ❌ No official Go SDK
  • ❌ No local testing
  • ❌ AWS vendor lock-in
  • ❌ Expensive
  • ❌ Cannot self-host

Use Case: ❌ Not recommended due to lack of Go SDK and local testing

Alternative: Build custom graph layer on top of Redis + PostgreSQL + S3 (as validated by RFC-057, RFC-059) provides better control and testability.


Recommendations

Primary Recommendation: Hybrid Approach ✅

Architecture (from RFC-059):

┌─────────────────────────────────────────────────────┐
│ Hot Tier (10%): Redis │
│ - 10B most-accessed vertices │
│ - Sub-millisecond latency │
│ - 21 TB RAM across 1000 nodes │
│ - Cost: $587k/month │
└─────────────────────────────────────────────────────┘

│ Temperature-based eviction

┌─────────────────────────────────────────────────────┐
│ Cold Tier (90%): S3/MinIO │
│ - 90B cold vertices │
│ - 50-200ms latency (parallel load) │
│ - Parquet snapshots │
│ - Cost: $4.3k/month │
└─────────────────────────────────────────────────────┘

│ Metadata queries

┌─────────────────────────────────────────────────────┐
│ Metadata: PostgreSQL │
│ - Partition metadata │
│ - Index structures (RFC-058) │
│ - Configuration │
│ - Cost: $500/month │
└─────────────────────────────────────────────────────┘

Rationale:

  • ✅ Redis for hot tier (10%) - validated by RFC-059 (95% cost reduction)
  • ✅ S3 for cold tier (90%) - 60-second recovery time
  • ✅ PostgreSQL for metadata - JSONB indexes
  • ✅ All three have excellent Go SDKs and local testing
  • ✅ Total cost: ~$592k/month vs $105M/month (all in-memory)

Implementation Phases

Phase 1: Core Storage (Weeks 13-14)

Components:

  1. Redis hot tier driver
  2. S3/MinIO cold tier driver
  3. PostgreSQL metadata driver

Go Packages:

pkg/storage/
├── interface.go // Storage interface
├── redis/
│ ├── driver.go // Redis hot tier
│ └── driver_test.go
├── s3/
│ ├── driver.go // S3 cold tier
│ └── driver_test.go
└── postgres/
├── driver.go // PostgreSQL metadata
└── driver_test.go


Phase 2: Testing Infrastructure (Week 14)

Local Stack:

# docker-compose.yml or Podman equivalent
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]

minio:
image: minio/minio
command: server /data --console-address ":9001"
ports: ["9000:9000", "9001:9001"]

postgres:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
ports: ["5432:5432"]

Test Helpers:

// pkg/testing/storage.go
func NewTestStorageBackends(t *testing.T) (*redis.Client, *s3.Client, *pgxpool.Pool) {
// Use testcontainers-go to spin up all three
// Return clients ready for testing
}


Phase 3: Specialized Backends (Week 15-16)

Optional additions based on workload:

  1. ClickHouse for audit logs (RFC-061)
  2. Kafka for event sourcing
  3. NATS for cache invalidation

Cost Analysis

Monthly Operational Costs (100B Vertices)

BackendUse CaseNodesCost/Month% of Total
RedisHot tier (10%)1000 × 32 GB$587,34799.1%
S3Cold tier (90%)189 TB$4,3470.7%
PostgreSQLMetadata3 replicas$5000.1%
ClickHouseAudit logs10 nodes$1,0000.2%
Total$593,194100%

Cost Breakdown:

  • Hot tier dominates costs (99%)
  • Cold tier is negligible (0.7%)
  • Total is 0.56% of all-in-memory cost ($105M/month)

Savings: 99.44% reduction ($104.4M → $593k/month)


Next Steps

Week 14: Performance Benchmarking

Focus: Validate storage backend performance under load

Tasks:

  1. Benchmark Redis hot tier (latency, throughput)
  2. Benchmark S3 cold tier (parallel load performance)
  3. Benchmark PostgreSQL metadata queries
  4. Measure temperature-based eviction performance
  5. Validate 60-second recovery time (RFC-059 claim)

Week 15: Disaster Recovery and Data Lifecycle

Focus: Backup, restore, replication strategies

Tasks:

  1. Redis persistence (RDB vs AOF trade-offs)
  2. S3 versioning and lifecycle policies
  3. PostgreSQL streaming replication
  4. Cross-region disaster recovery
  5. RPO/RTO validation

Week 16: Comprehensive Cost Analysis

Focus: Detailed cost modeling and optimization

Tasks:

  1. Detailed AWS/GCP/Azure pricing comparison
  2. Request cost analysis (S3 GET/PUT costs)
  3. Network egress costs
  4. Reserved instance vs on-demand savings
  5. Cost optimization recommendations

Appendices

Appendix A: Backend Scoring Rubric

Go SDK Quality (30 points):

  • Official SDK: +15 points
  • Good documentation: +5 points
  • Active community: +5 points
  • Idiomatic Go: +5 points

Data Model Fit (30 points):

  • Native graph support: +30 points
  • Relational with JSON: +25 points
  • Key-value: +20 points
  • Event streaming: +15 points
  • Object storage: +10 points

Testing Difficulty (20 points):

  • In-memory/embedded: +20 points
  • Single container: +18 points
  • Multiple containers: +14 points
  • External service only: +5 points

Operational Complexity (20 points):

  • Mature tooling: +10 points
  • Easy deployment: +5 points
  • Good monitoring: +3 points
  • Horizontal scaling: +2 points

Appendix B: Data Model Comparison

BackendVerticesEdgesPropertiesTraversals
NeptuneNativeNativeNativeNative ✅
RedisHash/JSONSorted SetsHash/JSONApplication
PostgreSQLTable + JSONBTableJSONBCTE (3 hops)
SQLiteTable + JSONTableJSONCTE (3 hops)
S3/MinIOParquetParquetColumnsBulk only
ClickHouseTableTableColumnsAnalytics only
KafkaN/AN/AN/AN/A
NATSN/AN/AN/AN/A

Appendix C: Testing Comparison

BackendStartup TimeDependenciesTestcontainersCI/CD Friendly
SQLite<1msNoneN/A✅ Excellent
Redis~1sNone✅ Yes✅ Excellent
PostgreSQL~3sNone✅ Yes✅ Good
NATS~1sNone✅ Yes✅ Excellent
MinIO~2sNone✅ Yes✅ Good
ClickHouse~10sNone✅ Yes⚠️ Moderate
Kafka~15sZookeeper✅ Yes⚠️ Moderate
NeptuneN/AAWS Account❌ No❌ Poor