MEMO-073: Week 13 - Storage Backend Evaluation for Massive-Scale Graphs
Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-052, RFC-057, RFC-058, RFC-059
Executive Summary
Goal: Evaluate storage backend options for 100B vertex graph system
Scope: 8 storage backends ranked by implementability for graph workloads
Findings:
- Best for graphs: Neptune (native graph), TigerGraph (native graph)
- Best for scale: S3/MinIO (cold storage), ClickHouse (time-series)
- Most practical: PostgreSQL + pg_timbala (relational), Redis (in-memory)
- Implementability winner: Redis (rank #1, score 95/100)
- Cost winner: S3/MinIO (cold storage tier)
Recommendation: Hybrid approach - Redis (hot tier) + S3 (cold tier) + PostgreSQL (metadata) as validated by RFC-059.
Methodology
Evaluation Criteria
Implementability Score (0-100):
- Go SDK Quality (30 points): Official SDK, community support, documentation
- Data Model Fit (30 points): How naturally backend supports graph operations
- Testing Difficulty (20 points): Local testing, Docker support, test data generation
- Operational Complexity (20 points): Deployment, monitoring, scaling
Data Models Supported
For graph workloads, backends must support:
- Vertices: Key-value or document storage
- Edges: Adjacency lists or edge tables
- Properties: Nested attributes on vertices/edges
- Indexes: Property lookups, traversal optimization
- Partitioning: Distribute across nodes
Findings
Backend Ranking Summary
| Rank | Backend | Score | Go SDK | Data Model | Testing | Best For |
|---|---|---|---|---|---|---|
| 1 | Redis | 95/100 | ✅ Excellent | ✅ Graph-friendly | ✅ Easy | Hot tier caching |
| 2 | PostgreSQL | 90/100 | ✅ Excellent | ✅ Good (JSONB) | ✅ Easy | Metadata, indexes |
| 3 | SQLite | 85/100 | ✅ Good | ✅ Good (JSON) | ✅ Trivial | Dev/testing |
| 4 | S3/MinIO | 80/100 | ✅ Good | ⚠️ Snapshot only | ✅ Easy | Cold storage |
| 5 | ClickHouse | 75/100 | ✅ Good | ⚠️ Time-series | ⚠️ Moderate | Analytics |
| 6 | Kafka | 70/100 | ✅ Good | ⚠️ Event stream | ⚠️ Moderate | Event sourcing |
| 7 | NATS | 65/100 | ✅ Good | ⚠️ Messaging | ⚠️ Moderate | Pub/sub |
| 8 | Neptune | 50/100 | ❌ None (HTTP) | ✅ Native graph | ❌ Hard | AWS-only graphs |
Key Insight: Native graph databases (Neptune, TigerGraph) score lowest on implementability despite best data model fit, due to lack of Go SDK and testing complexity.
Detailed Backend Evaluation
Rank #1: Redis (Score: 95/100) ✅
Overview
Type: In-memory key-value store with data structures Best For: Hot tier vertex/edge caching, real-time access patterns Used In: RFC-057 (hot tier), RFC-059 (10% hot data)
Go SDK Quality (30/30) ✅
// Official: github.com/redis/go-redis/v9
import "github.com/redis/go-redis/v9"
client := redis.NewClient(&redis.Options{
Addr: "localhost:6379",
DB: 0,
})
// Excellent API, strong typing, context support
ctx := context.Background()
err := client.Set(ctx, "vertex:123", vertexJSON, 0).Err()
Assessment:
- ✅ Official Go SDK maintained by Redis
- ✅ Excellent documentation with examples
- ✅ Strong community (19k+ GitHub stars)
- ✅ Context-aware, idiomatic Go
- ✅ Pipelining, transactions, pub/sub support
Data Model Fit (30/30) ✅
Vertex Storage:
// Option 1: Hash (structured)
client.HSet(ctx, "vertex:user:123", map[string]interface{}{
"id": "123",
"name": "Alice",
"age": 30,
"country": "USA",
})
// Option 2: JSON (Redis Stack)
client.JSONSet(ctx, "vertex:user:123", "$", vertexStruct)
Edge Storage (Adjacency Lists):
// Sorted set for edges (score = timestamp or weight)
client.ZAdd(ctx, "edges:user:123:friends", redis.Z{
Score: float64(time.Now().Unix()),
Member: "user:456",
})
// Retrieve friends
friends := client.ZRange(ctx, "edges:user:123:friends", 0, -1)
Indexes:
// Secondary indexes via sets
client.SAdd(ctx, "idx:country:USA", "user:123", "user:456")
// Retrieve all users in USA
usersInUSA := client.SMembers(ctx, "idx:country:USA")
Assessment:
- ✅ Native support for adjacency lists (sorted sets)
- ✅ Efficient property indexes (sets)
- ✅ JSON support via Redis Stack module
- ✅ Atomic operations for consistency
- ⚠️ No native graph traversal (implement in application)
Testing Difficulty (20/20) ✅
Local Testing:
# Podman/Docker
podman run -d --name redis -p 6379:6379 redis:7-alpine
# Or: redis-server (native install)
brew install redis
redis-server
Go Test Integration:
func TestRedisVertex(t *testing.T) {
// Use testcontainers-go for isolated tests
ctx := context.Background()
redisC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "redis:7-alpine",
ExposedPorts: []string{"6379/tcp"},
},
Started: true,
})
defer redisC.Terminate(ctx)
// Connect and test
endpoint, _ := redisC.Endpoint(ctx, "")
client := redis.NewClient(&redis.Options{Addr: endpoint})
// ... test code
}
Assessment:
- ✅ Single-binary, no dependencies
- ✅ Instant startup (<1 second)
- ✅ Excellent testcontainers-go support
- ✅ In-memory = fast tests
- ✅ No schema migrations needed
Operational Complexity (15/20) ✅
Deployment:
- ✅ Stateless deployment with Redis Cluster
- ✅ Excellent Kubernetes operators (Redis Enterprise, Bitnami)
- ⚠️ Persistence requires RDB/AOF configuration
- ⚠️ Memory management (eviction policies)
Monitoring:
- ✅ Built-in INFO command exposes all metrics
- ✅ Prometheus exporter available
- ✅ Grafana dashboards
Scaling:
- ✅ Horizontal: Redis Cluster (sharding)
- ✅ Vertical: Add memory
- ⚠️ Rebalancing requires cluster resharding
Assessment: Mature operational tooling, memory constraints require planning.
Overall Assessment
Strengths:
- ✅ Best Go SDK of all backends
- ✅ Perfect data model for hot tier graphs
- ✅ Trivial local testing
- ✅ Sub-millisecond latency
- ✅ Battle-tested at scale (Twitter, GitHub, StackOverflow)
Weaknesses:
- ⚠️ Memory-bound (expensive at 100B scale)
- ⚠️ No native graph traversal (application-level)
- ⚠️ Persistence trade-offs (RDB snapshots vs AOF overhead)
Use Case: ✅ Ideal for hot tier (10% of data) as validated by RFC-059
Rank #2: PostgreSQL (Score: 90/100) ✅
Overview
Type: Relational database with JSONB support Best For: Metadata, indexes, small-to-medium graphs Used In: RFC-058 (index storage), potential for partition metadata
Go SDK Quality (30/30) ✅
// Popular: github.com/lib/pq or github.com/jackc/pgx/v5
import "github.com/jackc/pgx/v5/pgxpool"
pool, _ := pgxpool.New(ctx, "postgres://user:pass@localhost:5432/graphdb")
// Excellent query builder, prepared statements
var vertex Vertex
err := pool.QueryRow(ctx,
"SELECT id, properties FROM vertices WHERE id = $1",
vertexID,
).Scan(&vertex.ID, &vertex.Properties)
Assessment:
- ✅ Multiple excellent Go drivers (lib/pq, pgx)
- ✅ Strong typing, connection pooling
- ✅ Excellent documentation
- ✅ Native support for JSON/JSONB
- ✅ Prepared statements, batch operations
Data Model Fit (25/30) ✅
Schema Design:
-- Vertices table
CREATE TABLE vertices (
id BIGINT PRIMARY KEY,
label TEXT NOT NULL,
properties JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Edges table (adjacency list)
CREATE TABLE edges (
src_id BIGINT NOT NULL,
dst_id BIGINT NOT NULL,
label TEXT NOT NULL,
properties JSONB,
PRIMARY KEY (src_id, dst_id, label)
);
-- Indexes for traversal
CREATE INDEX idx_edges_src ON edges(src_id);
CREATE INDEX idx_edges_dst ON edges(dst_id);
CREATE INDEX idx_vertices_props ON vertices USING GIN(properties);
Graph Operations:
-- Find friends (1-hop)
SELECT v.* FROM vertices v
JOIN edges e ON e.dst_id = v.id
WHERE e.src_id = 123 AND e.label = 'friend';
-- Property filter
SELECT * FROM vertices
WHERE properties @> '{"country": "USA"}';
-- 2-hop traversal (CTE)
WITH RECURSIVE friends AS (
SELECT dst_id, 1 as depth FROM edges WHERE src_id = 123
UNION
SELECT e.dst_id, f.depth + 1
FROM edges e
JOIN friends f ON e.src_id = f.dst_id
WHERE f.depth < 2
)
SELECT v.* FROM vertices v JOIN friends f ON v.id = f.dst_id;
Assessment:
- ✅ JSONB excellent for flexible properties
- ✅ GIN indexes for JSONB queries
- ✅ Recursive CTEs for traversals (up to ~3 hops practical)
- ⚠️ Deep traversals (4+ hops) become expensive
- ⚠️ No native graph algorithms
Testing Difficulty (20/20) ✅
Local Testing:
# Podman
podman run -d --name postgres \
-e POSTGRES_PASSWORD=secret \
-p 5432:5432 \
postgres:16-alpine
Test Helpers:
func TestPostgresGraph(t *testing.T) {
// Use testcontainers-go
ctx := context.Background()
pgC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "postgres:16-alpine",
Env: map[string]string{"POSTGRES_PASSWORD": "secret"},
ExposedPorts: []string{"5432/tcp"},
WaitingFor: wait.ForLog("database system is ready"),
},
Started: true,
})
defer pgC.Terminate(ctx)
// Run migrations, seed test data
// ... test code
}
Assessment:
- ✅ Excellent testcontainers-go support
- ✅ Fast startup (~3 seconds)
- ✅ Schema migrations via goose/migrate
- ✅ Test data generation straightforward
Operational Complexity (15/20) ✅
Deployment:
- ✅ Mature Kubernetes operators (Crunchy, Zalando)
- ✅ Excellent backup/restore (pg_dump, WAL archiving)
- ✅ Streaming replication
Monitoring:
- ✅ pg_stat_* views expose all metrics
- ✅ Excellent Prometheus exporters
- ✅ Deep observability (query plans, slow logs)
Scaling:
- ✅ Vertical: Add CPU/memory/storage
- ⚠️ Horizontal: Requires sharding (Citus, manual)
- ⚠️ Large tables (>100M rows) need partitioning
Assessment: Excellent operational maturity, horizontal scaling requires extensions.
Overall Assessment
Strengths:
- ✅ Excellent Go SDK (pgx)
- ✅ JSONB perfect for flexible properties
- ✅ Recursive CTEs for limited traversals
- ✅ Trivial local testing
- ✅ 40+ years of operational knowledge
Weaknesses:
- ⚠️ Deep traversals (4+ hops) expensive
- ⚠️ Horizontal scaling requires extensions
- ⚠️ Not optimized for graph algorithms
Use Case: ✅ Ideal for metadata, indexes, small graphs (<1B vertices)
Rank #3: SQLite (Score: 85/100) ✅
Overview
Type: Embedded relational database Best For: Development, testing, single-node graphs Used In: Local development, CI/CD tests
Go SDK Quality (28/30) ✅
// Popular: github.com/mattn/go-sqlite3 (CGo) or modernc.org/sqlite (pure Go)
import (
"database/sql"
_ "modernc.org/sqlite" // Pure Go, no CGo
)
db, _ := sql.Open("sqlite", "graph.db")
// Standard database/sql interface
rows, _ := db.Query("SELECT id, properties FROM vertices WHERE label = ?", "user")
Assessment:
- ✅ Pure Go option (modernc.org/sqlite) - no CGo
- ✅ Standard database/sql interface
- ✅ JSON1 extension for JSONB-like operations
- ⚠️ CGo version (mattn/go-sqlite3) more mature but complicates cross-compile
Data Model Fit (25/30) ✅
Schema (identical to PostgreSQL):
CREATE TABLE vertices (
id INTEGER PRIMARY KEY,
label TEXT NOT NULL,
properties JSON -- JSON1 extension
);
CREATE TABLE edges (
src_id INTEGER,
dst_id INTEGER,
label TEXT,
properties JSON,
PRIMARY KEY (src_id, dst_id, label)
);
JSON Operations:
-- JSON extraction (requires JSON1 extension)
SELECT * FROM vertices WHERE json_extract(properties, '$.country') = 'USA';
Assessment:
- ✅ Same schema as PostgreSQL (easy migration)
- ✅ JSON1 extension for property queries
- ✅ Recursive CTEs supported
- ⚠️ Performance degrades >10M rows
- ⚠️ Single-writer limitation
Testing Difficulty (20/20) ✅
Local Testing:
func TestSQLiteGraph(t *testing.T) {
// In-memory database
db, _ := sql.Open("sqlite", ":memory:")
// Or: temporary file
db, _ := sql.Open("sqlite", t.TempDir()+"/test.db")
// Run migrations, seed data
// ... test code
}
Assessment:
- ✅ Best testing experience - no external dependencies
- ✅ In-memory mode for ultra-fast tests
- ✅ Zero setup, zero teardown
- ✅ Perfect for CI/CD
Operational Complexity (12/20) ⚠️
Deployment:
- ✅ Embedded = zero deployment complexity
- ❌ Single-node only (no replication)
- ❌ Single-writer (write concurrency limited)
Monitoring:
- ⚠️ Limited built-in metrics
- ⚠️ Must implement application-level monitoring
Scaling:
- ❌ No horizontal scaling
- ⚠️ Vertical scaling limited by single file I/O
Assessment: Perfect for development, unsuitable for distributed production.
Overall Assessment
Strengths:
- ✅ Best testing experience (in-memory, no dependencies)
- ✅ Pure Go option available
- ✅ Same schema as PostgreSQL
- ✅ Perfect for CI/CD pipelines
Weaknesses:
- ❌ Single-node only
- ❌ Limited to ~10M rows before performance degrades
- ❌ Single-writer concurrency
Use Case: ✅ Development, testing, CI/CD - not production at scale
Rank #4: S3/MinIO (Score: 80/100) ✅
Overview
Type: Object storage Best For: Cold tier snapshots, bulk data loading Used In: RFC-059 (90% cold data), RFC-057 (bulk snapshots)
Go SDK Quality (28/30) ✅
// AWS SDK: github.com/aws/aws-sdk-go-v2
// MinIO SDK: github.com/minio/minio-go/v7
import "github.com/aws/aws-sdk-go-v2/service/s3"
cfg, _ := config.LoadDefaultConfig(ctx)
client := s3.NewFromConfig(cfg)
// Upload partition snapshot
_, err := client.PutObject(ctx, &s3.PutObjectInput{
Bucket: aws.String("graph-snapshots"),
Key: aws.String("partition-123.parquet"),
Body: snapshotReader,
})
Assessment:
- ✅ Official AWS SDK v2 (excellent)
- ✅ MinIO SDK compatible with S3 API
- ✅ Excellent documentation
- ✅ Concurrent uploads/downloads
- ⚠️ HTTP-based (not as ergonomic as native protocols)
Data Model Fit (20/30) ⚠️
Snapshot Format (from RFC-059):
// Parquet columnar format
type PartitionSnapshot struct {
Vertices []Vertex // Columnar: ID, Label, Properties
Edges []Edge // Columnar: SrcID, DstID, Label
Metadata Metadata // Version, timestamp, checksum
}
// S3 key structure
// s3://bucket/snapshots/v1/cluster-1/partition-0001/2025-11-16T00:00:00Z.parquet
Operations:
// Parallel load (1000 workers)
for partitionID := 0; partitionID < 1000; partitionID++ {
go func(id int) {
resp, _ := client.GetObject(ctx, &s3.GetObjectInput{
Bucket: aws.String("graph-snapshots"),
Key: aws.String(fmt.Sprintf("partition-%04d.parquet", id)),
})
// Decompress and load into memory
}(partitionID)
}
Assessment:
- ✅ Perfect for immutable snapshots
- ✅ Parallel loading (1000 workers = 60 seconds for 10 TB, per RFC-059)
- ✅ Versioning, lifecycle policies
- ❌ No random access to individual vertices
- ❌ Not suitable for transactional workloads
Testing Difficulty (18/20) ✅
Local Testing:
# MinIO (S3-compatible)
podman run -d --name minio \
-p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio server /data --console-address ":9001"
Go Tests:
func TestS3Snapshots(t *testing.T) {
// Use testcontainers-go with MinIO
minioC, _ := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "minio/minio",
Env: map[string]string{...},
ExposedPorts: []string{"9000/tcp"},
},
Started: true,
})
// ... test code
}
Assessment:
- ✅ MinIO provides S3-compatible local testing
- ✅ Testcontainers-go support
- ⚠️ Parquet encoding/decoding adds test complexity
Operational Complexity (14/20) ⚠️
Deployment:
- ✅ S3 = fully managed (AWS)
- ✅ MinIO = self-hosted alternative
- ✅ No state to manage (immutable objects)
Monitoring:
- ✅ CloudWatch metrics (S3)
- ✅ Prometheus exporter (MinIO)
- ⚠️ Request costs require careful tracking (RFC-059 finding)
Scaling:
- ✅ Infinite horizontal scaling
- ✅ 99.999999999% durability (S3)
- ⚠️ Request costs scale with operations
Assessment: Excellent for cold storage, request costs require monitoring.
Overall Assessment
Strengths:
- ✅ Excellent for immutable snapshots
- ✅ 95% cost reduction vs all-in-memory (RFC-059)
- ✅ Infinite scaling
- ✅ MinIO enables local testing
Weaknesses:
- ❌ No random access to vertices
- ❌ Not suitable for transactional workloads
- ⚠️ Request costs can exceed storage costs
Use Case: ✅ Cold tier (90% of data) as validated by RFC-059
Rank #5: ClickHouse (Score: 75/100) ⚠️
Overview
Type: Columnar OLAP database Best For: Time-series analytics, audit logs, query statistics Used In: Potential for RFC-061 audit log storage
Go SDK Quality (26/30) ✅
// Official: github.com/ClickHouse/clickhouse-go/v2
import "github.com/ClickHouse/clickhouse-go/v2"
conn, _ := clickhouse.Open(&clickhouse.Options{
Addr: []string{"localhost:9000"},
})
// Query with strong typing
rows, _ := conn.Query(ctx, "SELECT event_id, timestamp, vertex_id FROM audit_log WHERE timestamp > ?", time.Now().Add(-1*time.Hour))
Assessment:
- ✅ Official Go SDK
- ✅ Good documentation
- ✅ Native protocol (not HTTP)
- ⚠️ API less ergonomic than PostgreSQL
Data Model Fit (18/30) ⚠️
Schema Design:
-- Audit log (time-series)
CREATE TABLE audit_log (
event_id UInt64,
timestamp DateTime,
user_id UInt64,
action String,
vertex_id UInt64,
details String -- JSON string
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);
-- Query statistics (aggregations)
CREATE TABLE query_stats (
query_id UInt64,
timestamp DateTime,
latency_ms UInt32,
vertices_scanned UInt64,
partition_id UInt32
) ENGINE = MergeTree()
ORDER BY timestamp;
Queries:
-- Fast time-range scans
SELECT COUNT(*) FROM audit_log
WHERE timestamp BETWEEN '2025-11-01' AND '2025-11-16';
-- Aggregations
SELECT
toStartOfHour(timestamp) as hour,
COUNT(*) as event_count,
avg(latency_ms) as avg_latency
FROM query_stats
GROUP BY hour
ORDER BY hour;
Assessment:
- ✅ Excellent for time-series data (audit logs, metrics)
- ✅ Fast aggregations and analytics
- ❌ Poor fit for transactional graph operations
- ❌ No support for random vertex updates
Testing Difficulty (16/20) ⚠️
Local Testing:
# ClickHouse container
podman run -d --name clickhouse \
-p 9000:9000 -p 8123:8123 \
clickhouse/clickhouse-server
Assessment:
- ✅ Docker/Podman support
- ⚠️ Slower startup (~5-10 seconds)
- ⚠️ Schema migrations more complex
- ⚠️ Test data generation for columnar format
Operational Complexity (15/20) ⚠️
Deployment:
- ✅ Official Kubernetes operator
- ✅ Horizontal scaling (sharding, replication)
- ⚠️ Complex configuration for production
Monitoring:
- ✅ Built-in system tables (system.metrics, system.events)
- ✅ Prometheus exporter
- ⚠️ Requires expertise to tune
Scaling:
- ✅ Excellent horizontal scaling
- ✅ Columnar compression (10-100× better than row-based)
- ⚠️ Rebalancing shards requires planning
Assessment: Powerful for analytics, requires operational expertise.
Overall Assessment
Strengths:
- ✅ Excellent for audit logs (RFC-061)
- ✅ Fast time-series queries
- ✅ Columnar compression (99% reduction, per RFC-061)
Weaknesses:
- ❌ Poor fit for transactional graph operations
- ⚠️ Operational complexity
- ⚠️ Not suitable for hot path queries
Use Case: ⚠️ Specialized - audit logs and analytics only
Rank #6: Kafka (Score: 70/100) ⚠️
Overview
Type: Distributed event streaming Best For: Event sourcing, change data capture, real-time updates Used In: Potential for graph mutation streams
Go SDK Quality (25/30) ✅
// Popular: github.com/segmentio/kafka-go
import "github.com/segmentio/kafka-go"
writer := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: "graph-mutations",
})
// Write vertex update event
writer.WriteMessages(ctx, kafka.Message{
Key: []byte("vertex:123"),
Value: vertexUpdateJSON,
})
Assessment:
- ✅ Excellent Go libraries (segmentio/kafka-go, Shopify/sarama)
- ✅ Good documentation
- ⚠️ API complexity for distributed systems newcomers
Data Model Fit (15/30) ⚠️
Event Sourcing Model:
// Event stream (not direct vertex storage)
type GraphEvent struct {
EventType string // "VertexCreated", "EdgeAdded", "PropertyUpdated"
VertexID string
Payload json.RawMessage
Timestamp time.Time
}
// Consumers rebuild graph state
// Topic: graph-mutations
// Partition key: VertexID (ensures order per vertex)
Assessment:
- ⚠️ Not a storage backend - event streaming only
- ⚠️ Requires separate storage for materialized views
- ✅ Good for change data capture (CDC)
- ✅ Enables time-travel queries (replay events)
Testing Difficulty (14/20) ⚠️
Local Testing:
# Kafka + Zookeeper
podman run -d --name kafka \
-p 9092:9092 \
-e KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 \
confluentinc/cp-kafka
# Requires Zookeeper dependency
podman run -d --name zookeeper \
-p 2181:2181 \
confluentinc/cp-zookeeper
Assessment:
- ⚠️ Requires multiple containers (Kafka + Zookeeper or KRaft)
- ⚠️ Slower startup (~10-15 seconds)
- ⚠️ Topic management in tests adds complexity
Operational Complexity (16/20) ⚠️
Deployment:
- ✅ Mature Kubernetes operators (Strimzi)
- ✅ Excellent horizontal scaling
- ⚠️ Requires careful partition management
Monitoring:
- ✅ JMX metrics, Prometheus exporters
- ✅ Excellent observability
- ⚠️ Many metrics to track
Scaling:
- ✅ Horizontal scaling via partitions
- ✅ High throughput (millions of events/sec)
- ⚠️ Rebalancing can cause lag spikes
Assessment: Powerful for event streaming, overkill for simple graph storage.
Overall Assessment
Strengths:
- ✅ Excellent for change data capture
- ✅ Enables event sourcing patterns
- ✅ High throughput
Weaknesses:
- ❌ Not a storage backend (requires materialized views)
- ⚠️ Operational complexity
- ⚠️ Testing complexity (multiple dependencies)
Use Case: ⚠️ Specialized - event sourcing and CDC only
Rank #7: NATS (Score: 65/100) ⚠️
Overview
Type: Messaging system Best For: Real-time pub/sub, distributed cache invalidation Used In: Potential for cache invalidation notifications
Go SDK Quality (27/30) ✅
// Official: github.com/nats-io/nats.go
import "github.com/nats-io/nats.go"
nc, _ := nats.Connect("nats://localhost:4222")
// Publish cache invalidation
nc.Publish("cache.invalidate.vertex.123", []byte("invalidate"))
// Subscribe to invalidations
nc.Subscribe("cache.invalidate.*", func(msg *nats.Msg) {
// Handle invalidation
})
Assessment:
- ✅ Excellent official Go SDK
- ✅ Simple, idiomatic API
- ✅ Strong typing
- ✅ Native Go (no CGo)
Data Model Fit (12/30) ❌
Messaging Only:
// NATS is NOT a storage backend
// Use case: Notify proxies of vertex updates
type InvalidationMessage struct {
VertexID string
PartitionID int
Timestamp time.Time
}
Assessment:
- ❌ Not a storage backend - messaging only
- ❌ No persistence (JetStream adds persistence but still not a database)
- ✅ Good for cache invalidation notifications
Testing Difficulty (18/20) ✅
Local Testing:
# NATS server (single binary)
podman run -d --name nats -p 4222:4222 nats:latest
Assessment:
- ✅ Single binary, fast startup (<1 second)
- ✅ Simple testcontainers-go integration
- ✅ Embedded mode for tests (nats-server package)
Operational Complexity (8/20) ❌
Deployment:
- ✅ Simple deployment
- ✅ Kubernetes operators available
- ❌ Not a storage system
Monitoring:
- ✅ Prometheus exporter
- ✅ Built-in metrics endpoint
Scaling:
- ✅ Horizontal scaling (clustering)
- ❌ Not applicable for storage
Assessment: Excellent for messaging, not a storage backend.
Overall Assessment
Strengths:
- ✅ Excellent Go SDK
- ✅ Fast, lightweight
- ✅ Perfect for cache invalidation
Weaknesses:
- ❌ Not a storage backend
- ❌ Limited use case for graph storage
Use Case: ⚠️ Specialized - cache invalidation only
Rank #8: Neptune (Score: 50/100) ❌
Overview
Type: Native graph database (AWS managed) Best For: AWS-native graph applications Used In: Potential alternative to custom graph implementation
Go SDK Quality (10/30) ❌
Problem: No official Go SDK for Gremlin
// Must use HTTP/WebSocket client
import "github.com/go-gremlin/gremlin"
// Unofficial, limited support
client, _ := gremlin.NewClient("ws://neptune-endpoint:8182/gremlin")
query := "g.V('123').outE('friend').inV()"
results, _ := client.Execute(query)
Assessment:
- ❌ No official Go SDK
- ❌ HTTP/WebSocket-based (not native protocol)
- ⚠️ Unofficial libraries with limited support
- ⚠️ Gremlin query strings (no type safety)
Data Model Fit (30/30) ✅
Native Graph Model:
// Gremlin queries (native graph traversals)
// Add vertex
g.addV('user').property('id', '123').property('name', 'Alice')
// Add edge
g.V('123').addE('friend').to(V('456'))
// Traverse
g.V('123').outE('friend').inV().values('name')
// Complex traversal (2-hop)
g.V('123').out('friend').out('friend').dedup()
Assessment:
- ✅ Best data model for graphs
- ✅ Native graph traversals
- ✅ Graph algorithms (PageRank, shortest path)
- ✅ No impedance mismatch
Testing Difficulty (5/20) ❌
Problem: No local Neptune
# No Docker/Podman image
# Must use:
# 1. AWS Neptune (expensive for CI/CD)
# 2. TinkerGraph (in-memory, different semantics)
# 3. JanusGraph (different backend, setup complex)
Assessment:
- ❌ No local testing option
- ❌ Expensive to use real Neptune for CI/CD
- ❌ Alternatives (TinkerGraph) have different behavior
- ❌ Slow test feedback loop
Operational Complexity (5/20) ❌
Deployment:
- ❌ AWS-only (vendor lock-in)
- ❌ Cannot self-host
- ⚠️ Limited region availability
- ⚠️ Expensive ($0.58/hour for smallest instance)
Monitoring:
- ✅ CloudWatch metrics
- ⚠️ Limited visibility into query execution
Scaling:
- ✅ Read replicas
- ⚠️ Vertical scaling only (instance size)
- ❌ No horizontal sharding
Assessment: AWS-only is major limitation for local development and multi-cloud.
Overall Assessment
Strengths:
- ✅ Best native graph model
- ✅ Built-in graph algorithms
- ✅ Fully managed (AWS)
Weaknesses:
- ❌ No official Go SDK
- ❌ No local testing
- ❌ AWS vendor lock-in
- ❌ Expensive
- ❌ Cannot self-host
Use Case: ❌ Not recommended due to lack of Go SDK and local testing
Alternative: Build custom graph layer on top of Redis + PostgreSQL + S3 (as validated by RFC-057, RFC-059) provides better control and testability.
Recommendations
Primary Recommendation: Hybrid Approach ✅
Architecture (from RFC-059):
┌─────────────────────────────────────────────────────┐
│ Hot Tier (10%): Redis │
│ - 10B most-accessed vertices │
│ - Sub-millisecond latency │
│ - 21 TB RAM across 1000 nodes │
│ - Cost: $587k/month │
└─────────────────────────────────────────────────────┘
│
│ Temperature-based eviction
↓
┌─────────────────────────────────────────────────────┐
│ Cold Tier (90%): S3/MinIO │
│ - 90B cold vertices │
│ - 50-200ms latency (parallel load) │
│ - Parquet snapshots │
│ - Cost: $4.3k/month │
└─────────────────────────────────────────────────────┘
│
│ Metadata queries
↓
┌─────────────────────────────────────────────────────┐
│ Metadata: PostgreSQL │
│ - Partition metadata │
│ - Index structures (RFC-058) │
│ - Configuration │
│ - Cost: $500/month │
└─────────────────────────────────────────────────────┘
Rationale:
- ✅ Redis for hot tier (10%) - validated by RFC-059 (95% cost reduction)
- ✅ S3 for cold tier (90%) - 60-second recovery time
- ✅ PostgreSQL for metadata - JSONB indexes
- ✅ All three have excellent Go SDKs and local testing
- ✅ Total cost: ~$592k/month vs $105M/month (all in-memory)
Implementation Phases
Phase 1: Core Storage (Weeks 13-14)
Components:
- Redis hot tier driver
- S3/MinIO cold tier driver
- PostgreSQL metadata driver
Go Packages:
pkg/storage/
├── interface.go // Storage interface
├── redis/
│ ├── driver.go // Redis hot tier
│ └── driver_test.go
├── s3/
│ ├── driver.go // S3 cold tier
│ └── driver_test.go
└── postgres/
├── driver.go // PostgreSQL metadata
└── driver_test.go
Phase 2: Testing Infrastructure (Week 14)
Local Stack:
# docker-compose.yml or Podman equivalent
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
minio:
image: minio/minio
command: server /data --console-address ":9001"
ports: ["9000:9000", "9001:9001"]
postgres:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
ports: ["5432:5432"]
Test Helpers:
// pkg/testing/storage.go
func NewTestStorageBackends(t *testing.T) (*redis.Client, *s3.Client, *pgxpool.Pool) {
// Use testcontainers-go to spin up all three
// Return clients ready for testing
}
Phase 3: Specialized Backends (Week 15-16)
Optional additions based on workload:
- ClickHouse for audit logs (RFC-061)
- Kafka for event sourcing
- NATS for cache invalidation
Cost Analysis
Monthly Operational Costs (100B Vertices)
| Backend | Use Case | Nodes | Cost/Month | % of Total |
|---|---|---|---|---|
| Redis | Hot tier (10%) | 1000 × 32 GB | $587,347 | 99.1% |
| S3 | Cold tier (90%) | 189 TB | $4,347 | 0.7% |
| PostgreSQL | Metadata | 3 replicas | $500 | 0.1% |
| ClickHouse | Audit logs | 10 nodes | $1,000 | 0.2% |
| Total | $593,194 | 100% |
Cost Breakdown:
- Hot tier dominates costs (99%)
- Cold tier is negligible (0.7%)
- Total is 0.56% of all-in-memory cost ($105M/month)
Savings: 99.44% reduction ($104.4M → $593k/month)
Next Steps
Week 14: Performance Benchmarking
Focus: Validate storage backend performance under load
Tasks:
- Benchmark Redis hot tier (latency, throughput)
- Benchmark S3 cold tier (parallel load performance)
- Benchmark PostgreSQL metadata queries
- Measure temperature-based eviction performance
- Validate 60-second recovery time (RFC-059 claim)
Week 15: Disaster Recovery and Data Lifecycle
Focus: Backup, restore, replication strategies
Tasks:
- Redis persistence (RDB vs AOF trade-offs)
- S3 versioning and lifecycle policies
- PostgreSQL streaming replication
- Cross-region disaster recovery
- RPO/RTO validation
Week 16: Comprehensive Cost Analysis
Focus: Detailed cost modeling and optimization
Tasks:
- Detailed AWS/GCP/Azure pricing comparison
- Request cost analysis (S3 GET/PUT costs)
- Network egress costs
- Reserved instance vs on-demand savings
- Cost optimization recommendations
Appendices
Appendix A: Backend Scoring Rubric
Go SDK Quality (30 points):
- Official SDK: +15 points
- Good documentation: +5 points
- Active community: +5 points
- Idiomatic Go: +5 points
Data Model Fit (30 points):
- Native graph support: +30 points
- Relational with JSON: +25 points
- Key-value: +20 points
- Event streaming: +15 points
- Object storage: +10 points
Testing Difficulty (20 points):
- In-memory/embedded: +20 points
- Single container: +18 points
- Multiple containers: +14 points
- External service only: +5 points
Operational Complexity (20 points):
- Mature tooling: +10 points
- Easy deployment: +5 points
- Good monitoring: +3 points
- Horizontal scaling: +2 points
Appendix B: Data Model Comparison
| Backend | Vertices | Edges | Properties | Traversals |
|---|---|---|---|---|
| Neptune | Native | Native | Native | Native ✅ |
| Redis | Hash/JSON | Sorted Sets | Hash/JSON | Application |
| PostgreSQL | Table + JSONB | Table | JSONB | CTE (3 hops) |
| SQLite | Table + JSON | Table | JSON | CTE (3 hops) |
| S3/MinIO | Parquet | Parquet | Columns | Bulk only |
| ClickHouse | Table | Table | Columns | Analytics only |
| Kafka | N/A | N/A | N/A | N/A |
| NATS | N/A | N/A | N/A | N/A |
Appendix C: Testing Comparison
| Backend | Startup Time | Dependencies | Testcontainers | CI/CD Friendly |
|---|---|---|---|---|
| SQLite | <1ms | None | N/A | ✅ Excellent |
| Redis | ~1s | None | ✅ Yes | ✅ Excellent |
| PostgreSQL | ~3s | None | ✅ Yes | ✅ Good |
| NATS | ~1s | None | ✅ Yes | ✅ Excellent |
| MinIO | ~2s | None | ✅ Yes | ✅ Good |
| ClickHouse | ~10s | None | ✅ Yes | ⚠️ Moderate |
| Kafka | ~15s | Zookeeper | ✅ Yes | ⚠️ Moderate |
| Neptune | N/A | AWS Account | ❌ No | ❌ Poor |