Skip to main content

MEMO-076: Week 16 - Comprehensive Cost Analysis and TCO Comparison

Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-073, MEMO-074, MEMO-075, RFC-057, RFC-059

Executive Summary

Goal: Provide detailed cost analysis and total cost of ownership (TCO) comparison for massive-scale graph system

Scope: 3-year TCO across AWS, GCP, Azure, and commercial graph databases

Findings:

  • Hybrid architecture (Redis + S3 + PostgreSQL): $21.4M over 3 years
  • Commercial graph databases (Neptune, Neo4j Enterprise): $150M+ over 3 years
  • Cost savings: 86% vs commercial alternatives
  • Optimization opportunities: 12% additional savings via reserved instances
  • Break-even point: 8 months vs building on commercial platform

Recommendation: Deploy hybrid architecture on AWS with 3-year reserved instances for 12% additional savings


Methodology

Cost Components

Operational Costs (monthly recurring):

  1. Compute: EC2/VM instances for Redis, proxy nodes
  2. Storage: S3/GCS/Blob for cold tier, EBS/disk for hot tier
  3. Network: Data transfer, cross-AZ traffic, egress
  4. Database: RDS/Cloud SQL for PostgreSQL metadata
  5. Backup: Snapshot storage, cross-region replication

One-Time Costs:

  1. Development: Engineering time to build hybrid system
  2. Migration: Data migration from existing systems
  3. Training: Team training on new architecture

Ongoing Costs:

  1. Operations: SRE team, on-call rotation
  2. Monitoring: CloudWatch, Prometheus, Grafana
  3. Support: Cloud support plans

Detailed Cost Breakdown

AWS Pricing (Primary Analysis)

1. Redis Hot Tier (10% of data)

Infrastructure:

  • 1000 instances × r6i.4xlarge (16 vCPU, 128 GB RAM)
  • 21 TB total RAM (100M vertices × 1.12 GB per 1M)
  • Network: 10 Gbps per instance

Pricing (On-Demand, us-west-2):

Instance cost:
1000 instances × $2.016/hour × 730 hours/month = $1,471,680/month

EBS volumes (for RDB/AOF persistence):
1000 instances × 200 GB × $0.08/GB = $16,000/month

Network (cross-AZ traffic, 5% of queries):
100 TB/month × $0.01/GB = $1,000/month

Total hot tier: $1,488,680/month

Reserved Instance Pricing (3-year, All Upfront):

Instance cost:
1000 instances × $1.008/hour × 730 hours/month = $735,840/month
(50% savings vs on-demand)

Total hot tier (reserved): $752,840/month

Annual Savings (Reserved): $8.8M/year (50% reduction)


2. S3 Cold Tier (90% of data)

Storage:

  • 189 TB cold data (90B vertices × 2.1 KB average)
  • Parquet compressed (65% compression ratio)

Pricing (S3 Standard, us-west-2):

Storage cost:
189 TB × $0.023/GB = $4,347/month

PUT requests (hourly snapshot deltas):
1000 partitions × 24 snapshots/day × 30 days = 720,000 PUTs
720,000 × $0.005/1000 = $3.60/month

GET requests (10 cold tier loads/day for testing):
1000 partitions × 10 loads/day × 30 days = 300,000 GETs
300,000 × $0.0004/1000 = $0.12/month

Total cold tier: $4,351/month

Lifecycle Savings (tiered archival):

After 90 days → Glacier:
189 TB × $0.004/GB = $756/month (83% savings)

After 365 days → Deep Archive:
189 TB × $0.00099/GB = $187/month (96% savings)

Average over 3 years: $1,500/month

3. PostgreSQL Metadata

Infrastructure:

  • 1 primary + 2 sync replicas + 1 async replica (DR region)
  • db.r6i.xlarge (4 vCPU, 32 GB RAM)

Pricing (RDS, us-west-2):

Instance cost:
4 instances × $0.504/hour × 730 hours/month = $1,472/month

Storage (partition metadata, 500 GB):
500 GB × $0.115/GB = $58/month

Backup storage (automated backups, 1 TB):
1 TB × $0.095/GB = $95/month

Total metadata: $1,625/month

4. Proxy Nodes (Rust)

Infrastructure:

  • 1000 instances × c6i.2xlarge (8 vCPU, 16 GB RAM)
  • Stateless proxies (no storage)

Pricing (On-Demand, us-west-2):

Instance cost:
1000 instances × $0.34/hour × 730 hours/month = $248,200/month

Network (intra-AZ, no charge):
$0/month

Total proxy: $248,200/month

Reserved Instance Pricing (3-year):

Instance cost:
1000 instances × $0.17/hour × 730 hours/month = $124,100/month
(50% savings)

Total proxy (reserved): $124,100/month

5. Backup and DR

Costs (from MEMO-075):

Redis RDB snapshots (7 days retention):
294 TB × $0.023/GB = $6,762/month

PostgreSQL WAL archiving:
3 TB × $0.023/GB = $69/month

S3 snapshot deltas (incremental, 30 days):
1.89 TB/day × 30 days × $0.023/GB = $1,304/month

Cross-region replication:
189 TB × $0.02/GB = $3,864/month

Total backup/DR: $12,000/month

6. Monitoring and Operations

Infrastructure:

  • Prometheus (c6i.xlarge × 3 for HA)
  • Grafana (t3.medium)
  • CloudWatch logs and metrics

Pricing:

Prometheus instances:
3 × $0.17/hour × 730 hours = $372/month

Grafana:
1 × $0.0416/hour × 730 hours = $30/month

CloudWatch:
Logs ingestion: 10 TB/month × $0.50/GB = $5,000/month
Metrics: 100K custom metrics × $0.30/metric = $30,000/month
Alarms: 1000 alarms × $0.10/alarm = $100/month

Total monitoring: $35,502/month

Optimization: Use Prometheus/Grafana primarily, CloudWatch for AWS-specific metrics only → $5,000/month


AWS Total Cost Summary

Monthly Operational Costs (On-Demand):

ComponentCost/month% of total
Redis hot tier$1,488,68084.7%
Proxy nodes$248,20014.1%
S3 cold tier$4,3510.2%
PostgreSQL metadata$1,6250.1%
Backup/DR$12,0000.7%
Monitoring$5,0000.3%
Total$1,759,856100%

Annual: $21.1M/year

3-Year TCO (On-Demand): $63.4M


Monthly Operational Costs (Reserved Instances):

ComponentCost/month% of totalSavings
Redis hot tier (RI)$752,84086.2%50%
Proxy nodes (RI)$124,10014.2%50%
S3 cold tier$4,3510.5%0%
PostgreSQL metadata$1,6250.2%0%
Backup/DR$12,0001.4%0%
Monitoring$5,0000.6%0%
Total$899,916100%49%

Annual: $10.8M/year

3-Year TCO (Reserved Instances): $32.4M

3-Year Savings (Reserved vs On-Demand): $31M (49%)


GCP Pricing Comparison

Infrastructure Mapping

AWSGCPvCPURAM
r6i.4xlargen2-highmem-1616128 GB
c6i.2xlargen2-highcpu-888 GB
RDS PostgreSQLCloud SQL432 GB
S3 StandardGCS Standard--

Pricing (us-west1, On-Demand)

Redis hot tier:
1000 × n2-highmem-16 × $1.478/hour × 730 hours = $1,078,940/month

Proxy nodes:
1000 × n2-highcpu-8 × $0.2366/hour × 730 hours = $172,718/month

GCS cold tier:
189 TB × $0.020/GB = $3,780/month

Cloud SQL:
4 × db-n1-standard-4 × $0.2655/hour × 730 hours = $775/month

Backup/DR (similar to AWS):
$12,000/month

Monitoring (Cloud Monitoring):
$3,000/month

Total GCP (on-demand): $1,271,213/month

Annual: $15.3M/year

3-Year TCO (GCP On-Demand): $45.8M

Savings vs AWS On-Demand: 28% cheaper


GCP Committed Use Discounts (3-year)

Redis hot tier (57% discount):
$1,078,940 × 0.43 = $463,944/month

Proxy nodes (57% discount):
$172,718 × 0.43 = $74,269/month

Other costs (unchanged):
$19,555/month

Total GCP (committed): $557,768/month

Annual: $6.7M/year

3-Year TCO (GCP Committed): $20.0M

Savings vs AWS Reserved: 38% cheaper

Assessment: ✅ GCP is most cost-effective option


Azure Pricing Comparison

Infrastructure Mapping

AWSAzurevCPURAM
r6i.4xlargeE16ds v516128 GB
c6i.2xlargeF8s v2816 GB
RDS PostgreSQLAzure Database432 GB
S3 StandardBlob Storage Hot--

Pricing (West US 2, On-Demand)

Redis hot tier:
1000 × E16ds_v5 × $1.152/hour × 730 hours = $841,056/month

Proxy nodes:
1000 × F8s_v2 × $0.338/hour × 730 hours = $246,740/month

Blob Storage cold tier:
189 TB × $0.018/GB = $3,402/month

Azure Database for PostgreSQL:
4 × General Purpose (4 vCPU) × $0.294/hour × 730 hours = $858/month

Backup/DR:
$10,500/month

Monitoring (Azure Monitor):
$4,000/month

Total Azure (on-demand): $1,106,556/month

Annual: $13.3M/year

3-Year TCO (Azure On-Demand): $39.8M


Azure Reserved Instances (3-year)

Redis hot tier (62% discount):
$841,056 × 0.38 = $319,601/month

Proxy nodes (62% discount):
$246,740 × 0.38 = $93,761/month

Other costs (unchanged):
$18,760/month

Total Azure (reserved): $432,122/month

Annual: $5.2M/year

3-Year TCO (Azure Reserved): $15.6M

Savings vs AWS Reserved: 52% cheaper

Assessment: ✅ Azure is cheapest option


Commercial Graph Database Comparison

AWS Neptune

Infrastructure

Cluster Configuration:

  • 1000 db.r6g.16xlarge instances (64 vCPU, 512 GB RAM each)
  • 512 TB storage (100B vertices × 5 KB average)
  • No separate cold tier (all data in Neptune)

Pricing (us-west-2)

Instance cost:
1000 × db.r6g.16xlarge × $5.824/hour × 730 hours = $4,251,520/month

Storage:
512 TB × $0.10/GB = $51,200/month

I/O requests (1B IOPS/month):
1,000,000,000 × $0.20/1M = $200/month

Backup storage (512 TB):
512 TB × $0.021/GB = $10,752/month

Total Neptune: $4,313,472/month

Annual: $51.8M/year

3-Year TCO: $155.4M

vs Hybrid (AWS Reserved): 4.8× more expensive

Assessment: ❌ Prohibitively expensive at 100B scale


Neo4j Enterprise (Self-Hosted)

Licensing

Enterprise License:

  • $180,000/year per cluster (unlimited nodes within cluster)
  • Need 10 clusters for 100B vertices → $1.8M/year licensing

Infrastructure (AWS)

Compute (similar to Redis hot tier):
1000 × r6i.4xlarge × $1.008/hour × 730 hours = $735,840/month

Storage (all-SSD for performance):
512 TB × $0.08/GB (gp3) = $40,960/month

Backup/DR:
$15,000/month

Total Neo4j: $791,800/month + $150,000/month (licensing)
= $941,800/month

Annual: $11.3M/year

3-Year TCO: $33.9M

vs Hybrid (AWS Reserved): 1.05× slightly more expensive

Assessment: ⚠️ Comparable cost but vendor lock-in


TigerGraph Enterprise

Licensing

Enterprise License:

  • Quote-based pricing for 100B vertices
  • Estimated: $2M-3M/year for this scale

Infrastructure (AWS)

Compute:
1000 × r6i.4xlarge × $735,840/month (reserved)

Storage (compressed):
256 TB × $0.08/GB = $20,480/month

Total TigerGraph: $756,320/month + $200,000/month (licensing)
= $956,320/month

Annual: $11.5M/year

3-Year TCO: $34.5M

vs Hybrid (AWS Reserved): 1.06× slightly more expensive

Assessment: ⚠️ Similar to Neo4j, vendor-dependent


Cost Comparison Summary

3-Year Total Cost of Ownership

Solution3-Year TCOAnnualNotes
Hybrid (AWS Reserved)$32.4M$10.8MRecommended
Hybrid (GCP Committed)$20.0M$6.7MBest cost, but AWS ecosystem
Hybrid (Azure Reserved)$15.6M$5.2MCheapest, limited graph tooling
AWS Neptune$155.4M$51.8M4.8× more expensive
Neo4j Enterprise$33.9M$11.3M1.05× more expensive + vendor lock
TigerGraph Enterprise$34.5M$11.5M1.06× more expensive + vendor lock

Key Findings:

  • ✅ Hybrid architecture is most cost-effective
  • ✅ Azure cheapest cloud (52% less than AWS)
  • ❌ Neptune 4.8× more expensive (not viable at 100B scale)
  • ⚠️ Neo4j/TigerGraph comparable but add vendor dependency

Cost Optimization Opportunities

1. Reserved Instances / Committed Use

Current: On-demand pricing analyzed

Optimization: 3-year reserved instances

Savings:

  • AWS: 49% reduction ($31M over 3 years)
  • GCP: 57% reduction ($25.8M over 3 years)
  • Azure: 62% reduction ($24.2M over 3 years)

Recommendation: ✅ Commit to 3-year reserved instances


2. Spot Instances for Non-Critical Workloads

Use Cases:

  • Backup/restore testing
  • Performance benchmarking
  • Development environments

Savings:

  • 70-90% discount vs on-demand
  • Risk: Instance termination with 2-minute warning

Potential Monthly Savings: $50K-100K

Recommendation: ✅ Use spot instances for testing/dev


3. S3 Intelligent Tiering

Current: Manual lifecycle policies

Optimization: Automatic tiering based on access patterns

Benefits:

  • No retrieval fees (unlike Glacier)
  • Automatic optimization
  • 70% cost reduction for infrequent access

Pricing:

Intelligent Tiering:
189 TB × $0.0125/GB (avg after auto-tiering) = $2,363/month

vs Current Standard:
189 TB × $0.023/GB = $4,347/month

Savings: $1,984/month = $23.8K/year

Recommendation: ✅ Enable intelligent tiering


4. Cross-Region Transfer Reduction

Current: 5% cross-AZ traffic ($1K/month)

Optimization: Implement placement hints (RFC-057)

Target: Reduce to <1% cross-AZ traffic

Savings: $800/month = $9.6K/year

Recommendation: ✅ Implement placement hints


5. CloudWatch Cost Reduction

Current: $35K/month for logs + metrics

Optimization:

  • Use Prometheus for metrics (open-source)
  • Sample logs (95% sampling = 5% ingestion)
  • Retain only 7 days in CloudWatch, archive to S3

Optimized Cost:

CloudWatch (minimal):
Logs: 0.5 TB × $0.50/GB = $250/month
Metrics: 5K custom × $0.30 = $1,500/month
Total: $1,750/month

Prometheus (self-hosted):
3 × c6i.xlarge × $124/month = $372/month

Total monitoring: $2,122/month

Savings: $33,380/month = $400K/year

Recommendation: ✅ Reduce CloudWatch usage, self-host Prometheus


6. Graviton Instances (ARM)

Current: Intel-based instances (x86)

Optimization: Migrate to Graviton3 (ARM)

Pricing:

  • r7g.4xlarge (Graviton3): $1.614/hour (20% cheaper than r6i.4xlarge)
  • c7g.2xlarge (Graviton3): $0.2720/hour (20% cheaper than c6i.2xlarge)

Savings:

Redis hot tier:
1000 × ($2.016 - $1.614) × 730 hours = $293,460/month

Proxy nodes:
1000 × ($0.34 - $0.2720) × 730 hours = $49,640/month

Total Graviton savings: $343,100/month = $4.1M/year

Trade-off: Requires ARM-compatible binaries (Redis and Rust both support ARM)

Recommendation: ⚠️ Evaluate Graviton3 compatibility, 20% savings


Total Optimization Potential

Baseline (AWS On-Demand): $1,759,856/month

Optimizations Applied:

OptimizationSavings/month% reduction
Reserved Instances$859,94048.9%
Spot Instances (dev/test)$75,0004.3%
S3 Intelligent Tiering$1,9840.1%
Cross-AZ reduction$8000.05%
CloudWatch reduction$33,3801.9%
Graviton3 instances$343,10019.5%
Total Savings$1,314,20474.7%

Optimized Monthly Cost: $445,652/month

Optimized 3-Year TCO: $16.0M

Savings vs Baseline: $47.4M over 3 years (75% reduction)


Break-Even Analysis

Development Costs

Engineering Effort (to build hybrid system):

Senior Engineers: 4 engineers × 6 months × $200K/year = $400K
Staff Engineers: 2 engineers × 6 months × $300K/year = $300K
Principal Engineer: 1 engineer × 3 months × $400K/year = $100K

Total development: $800K

Infrastructure Costs (during development):

Dev/staging environments: $50K/month × 6 months = $300K

Total One-Time Cost: $1.1M


vs AWS Neptune

Monthly Savings: $4,313,472 - $445,652 = $3,867,820/month

Break-Even: $1.1M ÷ $3,867,820/month = 0.28 months (8 days)

Assessment: ✅ Break-even in 8 days vs Neptune


vs Neo4j Enterprise

Monthly Savings: $941,800 - $445,652 = $496,148/month

Break-Even: $1.1M ÷ $496,148/month = 2.2 months

Assessment: ✅ Break-even in 2.2 months vs Neo4j


vs Building In-House Graph Database

Alternative: Build custom graph database from scratch

Estimated Effort: 2 years, 10 engineers

Engineering cost:
10 engineers × 2 years × $250K/year = $5M

Infrastructure (2 years development):
$100K/month × 24 months = $2.4M

Total: $7.4M

vs Hybrid Architecture: $1.1M development + $10.8M/year operational = $11.9M (first year)

Assessment: ⚠️ Custom graph DB more expensive and higher risk


Risk Analysis

Cost Overrun Risks

Risk 1: Underestimated Data Growth

Scenario: Data grows 2× faster than expected (200B vertices in year 1)

Impact:

Double all costs: $445,652 × 2 = $891,304/month
Additional annual cost: $5.35M

Mitigation:

  • Monitor growth rate monthly
  • Implement data retention policies
  • Archive old data to Glacier

Likelihood: Medium Impact: High


Risk 2: Cross-AZ Traffic Higher Than Expected

Scenario: Placement hints reduce cross-AZ traffic to only 10% (not 1%)

Impact:

Current: 5% × $1K/month = $1K/month
Optimistic: 1% × $10K/month = $10K/month (modeled)
Pessimistic: 10% × $10K/month = $10K/month
Additional cost: $9K/month = $108K/year

Mitigation:

  • Aggressive placement hint strategy
  • Monitor cross-AZ traffic patterns
  • Optimize vertex placement algorithms

Likelihood: Low Impact: Low


Risk 3: Reserved Instance Lock-In

Scenario: Need to scale down due to lower demand

Impact:

Committed to 1000 instances for 3 years
If need only 500 instances, overpaying: $223K/month

Mitigation:

  • Start with 70% reserved, 30% on-demand
  • Use convertible RIs (higher cost but flexible)
  • Resell unused RIs on marketplace

Likelihood: Low Impact: Medium


Recommendations

Primary Recommendation

Deploy hybrid architecture on AWS with 3-year reserved instances

Rationale:

  1. ✅ 86% cheaper than commercial graph databases ($32.4M vs $155M over 3 years)
  2. ✅ No vendor lock-in (portable to GCP/Azure)
  3. ✅ Proven performance (validated in MEMO-074)
  4. ✅ Robust DR strategy (validated in MEMO-075)
  5. ✅ Break-even in 8 days vs Neptune

3-Year TCO: $32.4M (AWS Reserved) or $16.0M (fully optimized)


Alternative Recommendation

Deploy on Azure for lowest cost

Rationale:

  1. ✅ 52% cheaper than AWS ($15.6M vs $32.4M over 3 years)
  2. ✅ Same hybrid architecture (portable)
  3. ⚠️ Less mature graph tooling ecosystem
  4. ⚠️ Team learning curve on Azure

3-Year TCO: $15.6M (Azure Reserved)


Optimization Roadmap

Phase 1: Quick Wins (Month 1):

  • ✅ Enable S3 Intelligent Tiering ($24K/year savings)
  • ✅ Reduce CloudWatch usage ($400K/year savings)
  • ✅ Purchase 3-year reserved instances ($10.4M/year savings)

Phase 2: Medium-Term (Months 2-6):

  • Implement placement hints ($10K/year savings)
  • Evaluate Graviton3 migration ($4.1M/year savings)
  • Use spot instances for dev/test ($900K/year savings)

Phase 3: Long-Term (Year 2+):

  • Multi-cloud strategy (AWS + GCP for redundancy)
  • Custom ARM-optimized Redis build
  • Advanced cost anomaly detection

Next Steps

Weeks 17-20: Infrastructure Requirements

Focus: Detailed infrastructure planning and deployment readiness

Tasks:

  1. Week 17: Network and compute infrastructure design
  2. Week 18: Observability stack setup (Prometheus, Grafana, Jaeger)
  3. Week 19: Development tooling and CI/CD pipelines
  4. Week 20: Infrastructure gaps and readiness assessment

Success Criteria:

  • Production deployment plan with timeline
  • All infrastructure requirements documented
  • Cost model validated with pilot deployment
  • Team training completed

Appendices

Appendix A: Pricing Sources

AWS Pricing (as of 2025-11-16):

GCP Pricing:

Azure Pricing:


Appendix B: Cost Calculator Spreadsheet

Interactive Cost Model: cost-model.xlsx

Inputs:

  • Number of vertices (100M - 1T)
  • Hot tier percentage (5% - 20%)
  • Access pattern (Zipf alpha)
  • Cloud provider (AWS, GCP, Azure)
  • Commitment (on-demand, 1-year, 3-year)

Outputs:

  • Monthly operational cost
  • 3-year TCO
  • Cost per vertex
  • Cost per query

Appendix C: Sensitivity Analysis

Data Growth Impact:

VerticesHot Tier (10%)Cold TierTotal/month3-Year TCO
50B$376,420$2,176$378,596$13.6M
100B$752,840$4,351$757,191$27.3M
200B$1,505,680$8,702$1,514,382$54.5M
500B$3,764,200$21,755$3,785,955$136.3M

Linear Scaling: Cost scales linearly with data volume


Appendix D: ROI Calculation

Investment (Year 0): $1.1M (development)

Annual Savings vs Neptune:

  • Year 1: $51.8M - $10.8M = $41M saved
  • Year 2: $41M saved
  • Year 3: $41M saved
  • Total 3-year savings: $123M

ROI: ($123M - $1.1M) / $1.1M = 11,000% ROI over 3 years

Assessment: ✅ Exceptional return on investment


Appendix E: TCO vs Commercial Databases (Chart)

3-Year Total Cost of Ownership Comparison

$160M │ ╔═══════════════╗
│ ║ AWS Neptune ║ $155.4M
$140M │ ║ ║
│ ╚═══════════════╝
$120M │

$100M │

$80M │

$60M │

$40M │ ╔═══╗ ╔═══╗
│ ║Neo4j ║TigerG║
$20M │ ║$34M║ ║$35M║ ╔═AWS═╗ ╔═GCP═╗ ╔═Azure╗
│ ╚═══╝ ╚═══╝ ║$32M ║ ║$20M ║ ║$16M ║
0 └────────────────╚═════╝───╚═════╝───╚══════╝
Commercial Hybrid Hybrid Hybrid
Native Graph (AWS) (GCP) (Azure)

Insight: Hybrid architecture is 4.8× cheaper than native graph databases at 100B scale