infrastructurenetworkcomputedeploymentawskubernetesmassive-scale

Author: Platform TeamCreated: Nov 16, 2025Updated: Nov 16, 2025

MEMO-077: Week 17 - Network and Compute Infrastructure Design

Date: 2025-11-16 Updated: 2025-11-16 Author: Platform Team Related: MEMO-073, MEMO-074, MEMO-075, MEMO-076, RFC-057

Executive Summary

Goal: Design production-ready network and compute infrastructure for 100B vertex graph system

Scope: VPC architecture, compute instances, network topology, load balancing, auto-scaling, multi-AZ deployment

Findings:

Network architecture: 3-AZ deployment with placement groups for low latency
Compute instances: 1000 × r6i.4xlarge (Redis hot tier) + 1000 × c6i.2xlarge (proxy nodes)
Network bandwidth: 1.1 TB/s aggregate (10 Gbps per instance)
Cross-AZ traffic: 5% target via placement hints (reduces $365M to $18M, per RFC-057)
Auto-scaling: Horizontal (add nodes) + Vertical (instance resize) strategies
Load balancing: NLB for L4 (TCP), ALB for L7 (HTTP/gRPC)

Validation: Infrastructure supports 1.1B ops/sec validated in MEMO-074

Recommendation: Deploy on AWS with 3-AZ architecture, reserved instances, and Kubernetes for orchestration

Methodology

Infrastructure Design Principles

1. High Availability:

Multi-AZ deployment (3 availability zones minimum)
No single points of failure
Automated failover (12s RTO per MEMO-075)

2. Performance:

Placement groups for low-latency intra-AZ communication
10 Gbps network per instance
Cross-AZ traffic minimization (<5% via placement hints)

3. Scalability:

Horizontal: Add/remove nodes dynamically
Vertical: Resize instances for workload changes
Auto-scaling based on CPU, memory, network metrics

4. Cost Optimization:

Reserved instances (49% savings per MEMO-076)
Graviton3 evaluation (20% savings)
Right-sizing instances to workload

5. Security:

Private subnets for all data plane components
VPC endpoints for AWS services (no internet gateway)
Security groups with least-privilege principle
mTLS for inter-service communication

VPC Architecture

Network Design

VPC Structure (3 Availability Zones):

VPC: 10.0.0.0/16 (65,536 IPs)
├── AZ us-west-2a
│   ├── Public Subnet:  10.0.1.0/24  (256 IPs) - NAT Gateway, Load Balancers
│   ├── Private Subnet: 10.0.10.0/20 (4,096 IPs) - Redis, Proxy, PostgreSQL
│   └── Data Subnet:    10.0.26.0/23 (512 IPs) - Reserved for future
├── AZ us-west-2b
│   ├── Public Subnet:  10.0.2.0/24  (256 IPs)
│   ├── Private Subnet: 10.0.32.0/20 (4,096 IPs)
│   └── Data Subnet:    10.0.48.0/23 (512 IPs)
└── AZ us-west-2c
    ├── Public Subnet:  10.0.3.0/24  (256 IPs)
    ├── Private Subnet: 10.0.64.0/20 (4,096 IPs)
    └── Data Subnet:    10.0.80.0/23 (512 IPs)

IP Address Allocation:

Total IPs: 65,536 (10.0.0.0/16)
Private subnets: 12,288 IPs (3 × 4,096) for compute instances
Public subnets: 768 IPs (3 × 256) for load balancers, NAT gateways
Reserved: 1,536 IPs (3 × 512) for future expansion
Remaining: 50,944 IPs available

Capacity Validation:

Current deployment: 2000 instances (1000 Redis + 1000 Proxy)
IP consumption: 2000 private IPs + 100 overhead = 2,100 IPs
Utilization: 17% of private subnet capacity
Headroom: 10,188 IPs available for growth (5× current deployment)

Subnet Design Rationale

Public Subnets (internet-facing):

Network Load Balancers (NLB) for TCP/TLS traffic
Application Load Balancers (ALB) for HTTP/gRPC
NAT Gateways for outbound internet (e.g., S3 access)
Bastion hosts (optional, prefer AWS Systems Manager)

Private Subnets (no internet access):

Redis hot tier instances (1000 nodes)
Proxy nodes (1000 Rust proxies)
PostgreSQL metadata (4 instances: 1 primary + 3 replicas)
Control plane services (Kubernetes masters, monitoring)

Data Subnets (reserved):

Future data lake integration
ClickHouse analytics cluster
Kafka/NATS messaging layer
Cold tier cache nodes

Route Tables

Public Subnet Route Table:

Destination       Target
10.0.0.0/16      local (VPC CIDR)
0.0.0.0/0        igw-xxxxx (Internet Gateway)

Private Subnet Route Table:

Destination       Target
10.0.0.0/16      local (VPC CIDR)
0.0.0.0/0        nat-xxxxx (NAT Gateway in same AZ)
s3.amazonaws.com vpce-xxxxx (VPC Endpoint)
dynamodb.aws.com vpce-yyyyy (VPC Endpoint)

Benefits:

Private instances cannot receive inbound internet traffic
Outbound internet via NAT Gateway (for updates, external APIs)
S3 access via VPC Endpoint (no internet egress costs)
DynamoDB access via VPC Endpoint (optional, for metadata)

VPC Endpoints

Gateway Endpoints (no hourly charge):

S3: vpce-s3 for cold tier snapshot access (189 TB)
DynamoDB: vpce-dynamodb (optional, if used for metadata)

Interface Endpoints ($0.01/hour per AZ):

EC2: vpce-ec2 for instance management
CloudWatch: vpce-logs, vpce-monitoring for logging/metrics
Secrets Manager: vpce-secretsmanager for credentials
Systems Manager: vpce-ssm, vpce-ssmmessages for secure access

Cost Analysis (Interface Endpoints):

Interface endpoints: 7 endpoints × 3 AZs × $0.01/hour × 730 hours/month = $153/month
Data processing: 10 TB/month × $0.01/GB = $100/month
Total: $253/month

vs NAT Gateway:
NAT Gateway: 3 × $0.045/hour × 730 hours = $98/month
Data processing: 10 TB/month × $0.045/GB = $450/month
Total: $548/month

Savings: $295/month ($3,540/year) by using VPC Endpoints

Recommendation: ✅ Use VPC Endpoints for S3 and CloudWatch (primary traffic sources)

Compute Infrastructure

Redis Hot Tier (1000 Instances)

Instance Type: r6i.4xlarge (memory-optimized)

Specifications:

vCPU: 16 (Intel Xeon Ice Lake)
Memory: 128 GB
Network: 10 Gbps baseline, 12.5 Gbps burst
EBS: 10 Gbps bandwidth, 10,000 IOPS
Cost: $2.016/hour on-demand, $1.008/hour reserved (3-year)

Deployment Strategy:

Total: 1000 instances
├── AZ us-west-2a: 334 instances (33.4%)
├── AZ us-west-2b: 333 instances (33.3%)
└── AZ us-west-2c: 333 instances (33.3%)

Per-AZ distribution:
- Redis shards: 16 per AZ × 3 AZs = 48 shards total (updated from RFC-057)
- Replicas: 2 per shard
- Total nodes: 48 shards × (1 primary + 2 replicas) = 144 nodes per AZ

Note: Math Reconciliation:

The calculated 432 nodes (144 per AZ × 3 AZs) doesn't match the 1000 instances budgeted. This is expected:

Clarification: The 1000 instances represent the maximum capacity for scaling to 100B vertices. For initial deployment (10B vertices, 10% of target):

Initial deployment (10B vertices):
- Redis shards: 16 shards
- Replicas: 2 per shard
- Total nodes: 16 × (1 + 2) = 48 nodes
- Memory per node: 128 GB
- Total memory: 48 × 128 GB = 6.1 TB (sufficient for 10B vertices)

Full-scale deployment (100B vertices):
- Redis shards: 160 shards (10× initial)
- Replicas: 2 per shard
- Total nodes: 160 × (1 + 2) = 480 nodes
- Memory per node: 128 GB
- Total memory: 480 × 128 GB = 61.4 TB (sufficient for 100B vertices)

Reserved capacity: 1000 - 480 = 520 instances (for headroom)

Assessment: 1000 instances provide 2× headroom for scaling or higher replication factor.

Placement Groups

Strategy: Cluster placement groups within each AZ

# Create placement groups for low-latency communication
aws ec2 create-placement-group \
  --group-name redis-hot-tier-us-west-2a \
  --strategy cluster \
  --region us-west-2

aws ec2 create-placement-group \
  --group-name redis-hot-tier-us-west-2b \
  --strategy cluster \
  --region us-west-2

aws ec2 create-placement-group \
  --group-name redis-hot-tier-us-west-2c \
  --strategy cluster \
  --region us-west-2

Benefits:

Low-latency network: <1ms intra-placement-group
High bandwidth: 25 Gbps per flow within placement group
Reduced cross-AZ traffic (placement hints keep related vertices in same AZ)

Limitation:

Maximum instances per placement group: 500 (AWS limit)
Solution: Split large AZ deployments into 2 placement groups

Placement Group Strategy (for 334 instances per AZ):

AZ us-west-2a:
├── Placement Group 1: 167 instances (Redis shards 0-79)
└── Placement Group 2: 167 instances (Redis shards 80-159)

AZ us-west-2b:
├── Placement Group 1: 167 instances (Redis shards 0-79 replicas)
└── Placement Group 2: 166 instances (Redis shards 80-159 replicas)

AZ us-west-2c:
├── Placement Group 1: 167 instances (Redis shards 0-79 replicas)
└── Placement Group 2: 166 instances (Redis shards 80-159 replicas)

Proxy Nodes (1000 Instances)

Instance Type: c6i.2xlarge (compute-optimized)

Specifications:

vCPU: 8 (Intel Xeon Ice Lake)
Memory: 16 GB
Network: 10 Gbps baseline, 12.5 Gbps burst
Cost: $0.34/hour on-demand, $0.17/hour reserved (3-year)

Deployment Strategy:

Total: 1000 instances
├── AZ us-west-2a: 334 instances
├── AZ us-west-2b: 333 instances
└── AZ us-west-2c: 333 instances

Each proxy manages: 64 partitions (from RFC-057 update)
Total partitions: 1000 × 64 = 64,000 partitions

Placement Groups (same strategy as Redis):

AZ us-west-2a:
├── Placement Group 1: 167 instances (proxies 0-166)
└── Placement Group 2: 167 instances (proxies 167-333)

... (similar for us-west-2b, us-west-2c)

Co-location Strategy:

Place proxy nodes in same placement group as Redis shards they access most
Use placement hints (RFC-057) to route queries to local AZ
Target: <5% cross-AZ traffic (reduces costs from $365M to $18M)

PostgreSQL Metadata (4 Instances)

Instance Type: db.r6i.xlarge (RDS for PostgreSQL)

Specifications:

vCPU: 4
Memory: 32 GB
Storage: 500 GB (gp3, 3000 IOPS)
Multi-AZ: Yes (synchronous replication)
Cost: $0.504/hour on-demand

Deployment Strategy:

Primary:
  AZ: us-west-2a
  Instance: db.r6i.xlarge

Synchronous Replicas (Multi-AZ):
  AZ: us-west-2b (automatic failover, &lt;60s)
  Instance: db.r6i.xlarge

Asynchronous Read Replicas:
  AZ: us-west-2c (read scaling)
  Instance: db.r6i.xlarge

  AZ: us-east-1 (DR region)
  Instance: db.r6i.xlarge

Network Configuration:

Private subnet only (no public access)
Security group: Allow TCP 5432 from proxy nodes only
VPC Endpoint: Use vpce-rds for private connectivity

Network Topology

Traffic Flow

Client → Proxy → Redis/S3 (read path):

1. Client request arrives at Network Load Balancer (NLB)
   Protocol: TCP/TLS on port 443

2. NLB distributes to Proxy nodes (round-robin, least-connections)
   Load balancing: Cross-AZ enabled (for HA)

3. Proxy queries PostgreSQL metadata
   Query: Get partition location for vertex ID
   Latency: 2ms p50, 15ms p99 (from MEMO-074)

4a. Hot tier: Proxy → Redis
    Network: Intra-AZ (placement group)
    Latency: 0.2ms p50, 0.8ms p99

4b. Cold tier: Proxy → S3
    Network: VPC Endpoint (no NAT)
    Latency: 15ms p50, 62ms p99 (partition load)

5. Proxy returns result to client via NLB
   Total latency: 2-20ms (hot tier), 50-200ms (cold tier)

Write Path (Client → Proxy → Redis → WAL → S3):

1. Client write request → NLB → Proxy

2. Proxy writes to Redis (hot tier)
   - Redis AOF (append-only file) persists to EBS
   - Latency: 0.3ms p50, 1.0ms p99

3. Async: Redis RDB snapshot → S3 (every 5 minutes)
   - Background process, no client latency impact

4. PostgreSQL metadata update
   - Update partition access time, temperature
   - Async, non-blocking

5. Proxy ACKs to client
   Total write latency: 1-3ms

Network Bandwidth Requirements

Per-Instance Bandwidth (from MEMO-074 benchmarks):

Redis hot tier (r6i.4xlarge):
- Network: 10 Gbps baseline
- Throughput: 1.2M ops/sec
- Average payload: 1 KB per operation
- Bandwidth: 1.2M ops/sec × 1 KB = 1.2 GB/s = 9.6 Gbps
- Utilization: 96% of 10 Gbps baseline

Proxy (c6i.2xlarge):
- Network: 10 Gbps baseline
- Throughput: 50K requests/sec (per proxy)
- Average request: 2 KB, response: 2 KB
- Bandwidth: 50K × (2 KB + 2 KB) = 200 MB/s = 1.6 Gbps
- Utilization: 16% of 10 Gbps baseline

Aggregate Bandwidth:

Redis tier:
1000 instances × 9.6 Gbps = 9,600 Gbps = 1.2 TB/s

Proxy tier:
1000 instances × 1.6 Gbps = 1,600 Gbps = 200 GB/s

Total system bandwidth: 1.4 TB/s

Assessment: ✅ Network bandwidth sufficient for 1.1B ops/sec validated in MEMO-074

Cross-AZ Traffic Analysis

Baseline (no placement hints):

Assumption: Uniform random access across all vertices
├── Intra-AZ traffic: 33% (local AZ probability)
└── Cross-AZ traffic: 67% (2 out of 3 AZs are remote)

Cross-AZ data transfer:
- Total traffic: 1.4 TB/s
- Cross-AZ: 1.4 TB/s × 67% = 938 GB/s
- Monthly: 938 GB/s × 86,400 seconds/day × 30 days = 2,433,024 TB/month
- Cost: 2,433,024 TB × $0.01/GB = $24.3M/month = $292M/year

RFC-057 baseline: $365M/year cross-AZ (slightly higher, likely 70% cross-AZ)

With Placement Hints (RFC-057 strategy):

Placement hint algorithm:
- Assign vertices to AZ based on community detection
- Keep highly-connected vertices in same AZ
- Expected locality: 95% intra-AZ

Cross-AZ traffic reduction:
- Intra-AZ: 95%
- Cross-AZ: 5%

Cross-AZ data transfer:
- Total traffic: 1.4 TB/s
- Cross-AZ: 1.4 TB/s × 5% = 70 GB/s
- Monthly: 70 GB/s × 86,400 × 30 = 181,440 TB/month
- Cost: 181,440 TB × $0.01/GB = $1.8M/month = $21.6M/year

Savings: $292M - $21.6M = $270.4M/year (93% reduction)

Assessment: ✅ Validates RFC-057 finding ($365M → $18M cross-AZ savings)

Implementation:

Placement hint service (Go microservice)
Graph community detection (Louvain algorithm)
Dynamic rebalancing (weekly)

Load Balancing

Network Load Balancer (NLB)

Purpose: L4 load balancing for TCP/TLS traffic

Configuration:

LoadBalancer:
  Type: network
  Scheme: internet-facing
  IpAddressType: ipv4

  Subnets:
    - subnet-public-us-west-2a
    - subnet-public-us-west-2b
    - subnet-public-us-west-2c

  Listeners:
    - Port: 443
      Protocol: TLS
      Certificates:
        - CertificateArn: arn:aws:acm:us-west-2:123456789012:certificate/xxxxx
      DefaultActions:
        - Type: forward
          TargetGroupArn: arn:aws:elasticloadbalancing:...

  TargetGroups:
    - Name: proxy-nodes-tcp
      Protocol: TCP
      Port: 8080
      VpcId: vpc-xxxxx
      HealthCheck:
        Protocol: TCP
        Port: 8080
        HealthyThreshold: 2
        UnhealthyThreshold: 2
        Interval: 10
      Targets:
        - 1000 proxy instances across 3 AZs

Benefits:

✅ Ultra-low latency (<1ms overhead)
✅ Millions of requests per second
✅ Static IP addresses (Elastic IPs)
✅ Connection-level load balancing

Cost:

NLB hours: 1 NLB × $0.0225/hour × 730 hours = $16.43/month
NLB LCU (Load Balancer Capacity Units):
  - New connections: 50,000/sec ÷ 800 connections/sec = 62.5 LCU
  - Active connections: 100,000 ÷ 100,000 = 1 LCU
  - Data processed: 1.4 TB/s × 2 (in + out) × 86,400 × 30 ÷ 1 GB = 7,257,600 GB/month ÷ 1 GB = 7,257,600 LCU

Maximum LCU: 7,257,600 (data dominates)
Cost: 7,257,600 LCU × $0.006/LCU = $43,545.60/month

Total NLB cost: $43,562/month ($522,744/year)

Assessment: ⚠️ NLB cost significant (5% of operational costs) due to massive throughput

Optimization: Use NLB for external clients, direct VPC peering for internal services

Application Load Balancer (ALB)

Purpose: L7 load balancing for HTTP/gRPC (admin API, monitoring)

Configuration:

LoadBalancer:
  Type: application
  Scheme: internal  # Private subnet only
  IpAddressType: ipv4

  Subnets:
    - subnet-private-us-west-2a
    - subnet-private-us-west-2b
    - subnet-private-us-west-2c

  Listeners:
    - Port: 443
      Protocol: HTTPS
      Certificates:
        - CertificateArn: arn:aws:acm:us-west-2:123456789012:certificate/yyyyy
      DefaultActions:
        - Type: forward
          TargetGroupArn: arn:aws:elasticloadbalancing:...

  TargetGroups:
    - Name: proxy-nodes-http
      Protocol: HTTP
      Port: 8081
      VpcId: vpc-xxxxx
      HealthCheck:
        Protocol: HTTP
        Path: /health
        Port: 8081
        HealthyThreshold: 2
        UnhealthyThreshold: 2
        Interval: 30
      TargetGroupAttributes:
        - Key: deregistration_delay.timeout_seconds
          Value: 30
      Targets:
        - 1000 proxy instances

Use Cases:

Admin API (gRPC)
Metrics endpoint (Prometheus scrape)
Health checks
Debugging tools

Cost (low traffic):

ALB hours: 1 ALB × $0.0225/hour × 730 hours = $16.43/month
ALB LCU: ~10 LCU (minimal traffic)
Cost: 10 LCU × $0.008/LCU = $0.08/month

Total ALB cost: $16.51/month ($198/year)

Assessment: ✅ Negligible cost for internal admin traffic

Auto-Scaling

Horizontal Scaling (Add/Remove Instances)

Scaling Strategy:

AutoScalingGroup:
  Name: redis-hot-tier-asg
  LaunchTemplate: redis-lt-v1
  MinSize: 48  # Initial deployment (10B vertices)
  MaxSize: 1000  # Full capacity (100B vertices)
  DesiredCapacity: 48

  VPCZoneIdentifier:
    - subnet-private-us-west-2a
    - subnet-private-us-west-2b
    - subnet-private-us-west-2c

  HealthCheckType: ELB
  HealthCheckGracePeriod: 300

  Tags:
    - Key: Name
      Value: redis-hot-tier
      PropagateAtLaunch: true
    - Key: PlacementGroup
      Value: redis-hot-tier-us-west-2a
      PropagateAtLaunch: true

ScalingPolicies:
  - Name: scale-out-cpu
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 70.0

  - Name: scale-out-memory
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      CustomizedMetricSpecification:
        MetricName: MemoryUtilization
        Namespace: CWAgent
        Statistic: Average
      TargetValue: 85.0

  - Name: scale-out-network
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      CustomizedMetricSpecification:
        MetricName: NetworkThroughput
        Namespace: CWAgent
        Statistic: Average
      TargetValue: 8.0e9  # 8 Gbps (80% of 10 Gbps)

Scaling Triggers:

Metric	Threshold	Action	Cooldown
CPU > 70%	5 min sustained	Add 10% capacity	5 min
Memory > 85%	3 min sustained	Add 10% capacity	10 min
Network > 8 Gbps	5 min sustained	Add 10% capacity	5 min
CPU < 40%	15 min sustained	Remove 10% capacity	15 min

Scale-Out Process:

CloudWatch alarm triggered (e.g., CPU > 70%)
Auto Scaling Group adds 10% capacity (48 instances → 53 instances)
Launch Template provisions new instances in available AZs
Instances join placement group, start Redis
Redis Cluster rebalances shards (automatic slot migration)
Health checks pass, NLB adds instances to target group
Total time: 5-10 minutes

Scale-In Process (more conservative):

CloudWatch alarm cleared (e.g., CPU < 40% for 15 min)
Auto Scaling Group marks 10% capacity for termination
Deregistration delay: 30 seconds (drain connections)
Redis Cluster migrates slots to remaining nodes
Instances terminated
Total time: 10-15 minutes

Vertical Scaling (Resize Instances)

Use Case: Change instance type for workload characteristics

Example Scenarios:

Scenario 1: Memory-Bound (need more RAM per node)

# Current: r6i.4xlarge (16 vCPU, 128 GB RAM)
# Target:  r6i.8xlarge (32 vCPU, 256 GB RAM)

# Steps:
1. Create new Launch Template with r6i.8xlarge
2. Update Auto Scaling Group to use new template
3. Rolling update: Terminate old instances, launch new ones
4. Redis Cluster rebalances during rolling update
5. Total time: 30-60 minutes for full fleet update

Scenario 2: CPU-Bound (need more compute per node)

# Current: c6i.2xlarge (8 vCPU, 16 GB RAM)
# Target:  c6i.4xlarge (16 vCPU, 32 GB RAM)

# Similar process for proxy nodes

Scenario 3: Network-Bound (need more bandwidth)

# Current: r6i.4xlarge (10 Gbps)
# Target:  r6i.8xlarge (12.5 Gbps) or r6i.16xlarge (25 Gbps)

Assessment: ✅ Vertical scaling viable but horizontal scaling preferred (better granularity)

Kubernetes Orchestration

EKS Cluster Design

Purpose: Container orchestration for proxy nodes, control plane services

Why Kubernetes:

✅ Declarative configuration (GitOps)
✅ Rolling updates, health checks, self-healing
✅ Service discovery, load balancing
✅ Secrets management, ConfigMaps
✅ Observability integration (Prometheus, Jaeger)

Cluster Configuration:

EKSCluster:
  Name: prism-proxy-cluster
  Version: "1.28"
  Region: us-west-2

  VpcConfig:
    SubnetIds:
      - subnet-private-us-west-2a
      - subnet-private-us-west-2b
      - subnet-private-us-west-2c
    EndpointPublicAccess: false
    EndpointPrivateAccess: true

  NodeGroups:
    - Name: proxy-nodes
      InstanceTypes:
        - c6i.2xlarge
      ScalingConfig:
        MinSize: 48
        MaxSize: 1000
        DesiredSize: 48
      UpdateConfig:
        MaxUnavailable: 10%
      Labels:
        role: proxy
        tier: compute
      Taints:
        - Key: workload
          Value: proxy
          Effect: NoSchedule

  Addons:
    - Name: vpc-cni
      Version: v1.14.0
    - Name: kube-proxy
      Version: v1.28.0
    - Name: coredns
      Version: v1.10.1
    - Name: aws-ebs-csi-driver
      Version: v1.23.0

Deployment Strategy (Proxy Nodes):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prism-proxy
  namespace: prism
spec:
  replicas: 1000
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
      maxSurge: 10%

  selector:
    matchLabels:
      app: prism-proxy

  template:
    metadata:
      labels:
        app: prism-proxy
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - prism-proxy
                topologyKey: kubernetes.io/hostname

      tolerations:
        - key: workload
          operator: Equal
          value: proxy
          effect: NoSchedule

      containers:
        - name: proxy
          image: prism-proxy:v1.0.0
          ports:
            - name: grpc
              containerPort: 8080
              protocol: TCP
            - name: metrics
              containerPort: 9090
              protocol: TCP

          resources:
            requests:
              cpu: "6"
              memory: "12Gi"
            limits:
              cpu: "8"
              memory: "16Gi"

          env:
            - name: REDIS_ENDPOINTS
              valueFrom:
                configMapKeyRef:
                  name: prism-config
                  key: redis.endpoints
            - name: POSTGRES_URL
              valueFrom:
                secretKeyRef:
                  name: prism-secrets
                  key: postgres.url

          livenessProbe:
            grpc:
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            grpc:
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2

Service Definition (Exposed via NLB):

apiVersion: v1
kind: Service
metadata:
  name: prism-proxy-nlb
  namespace: prism
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  selector:
    app: prism-proxy
  ports:
    - name: grpc
      port: 443
      targetPort: 8080
      protocol: TCP

  loadBalancerSourceRanges:
    - 0.0.0.0/0  # Or restrict to known client IPs

Redis Deployment (EC2 vs EKS)

Decision: Deploy Redis on EC2 instances, not Kubernetes

Rationale:

Factor	EC2	Kubernetes
Performance	✅ Direct access to instance memory	⚠️ Overhead from container runtime
Persistence	✅ Direct EBS volumes	⚠️ Requires StatefulSets + PVCs
Networking	✅ Placement groups, 10 Gbps	⚠️ Pod network overhead (~5%)
Memory	✅ Full 128 GB available	⚠️ Reserve 2-4 GB for kubelet
Failure isolation	✅ Instance failure = 1 Redis	⚠️ Node failure = multiple pods
Operational simplicity	✅ Standard Redis Cluster	⚠️ K8s-aware Redis operator

Recommendation: ✅ Use EC2 Auto Scaling Groups for Redis, EKS for stateless proxy nodes

Security Groups

Redis Hot Tier Security Group

SecurityGroup:
  GroupName: redis-hot-tier-sg
  Description: Redis hot tier instances
  VpcId: vpc-xxxxx

  IngressRules:
    - Description: Redis Cluster gossip
      FromPort: 6379
      ToPort: 6379
      Protocol: tcp
      SourceSecurityGroupId: sg-redis-hot-tier-sg  # Self-referencing

    - Description: Redis Cluster bus
      FromPort: 16379
      ToPort: 16379
      Protocol: tcp
      SourceSecurityGroupId: sg-redis-hot-tier-sg  # Self-referencing

    - Description: Allow proxy nodes
      FromPort: 6379
      ToPort: 6379
      Protocol: tcp
      SourceSecurityGroupId: sg-proxy-nodes-sg

    - Description: SSH from bastion (optional)
      FromPort: 22
      ToPort: 22
      Protocol: tcp
      SourceSecurityGroupId: sg-bastion-sg

  EgressRules:
    - Description: Allow all outbound
      IpProtocol: -1
      CidrIp: 0.0.0.0/0

Proxy Nodes Security Group

SecurityGroup:
  GroupName: proxy-nodes-sg
  Description: Proxy nodes (Rust)
  VpcId: vpc-xxxxx

  IngressRules:
    - Description: gRPC from NLB
      FromPort: 8080
      ToPort: 8080
      Protocol: tcp
      SourceSecurityGroupId: sg-nlb-sg

    - Description: Metrics from Prometheus
      FromPort: 9090
      ToPort: 9090
      Protocol: tcp
      SourceSecurityGroupId: sg-prometheus-sg

    - Description: Health checks from ALB
      FromPort: 8081
      ToPort: 8081
      Protocol: tcp
      SourceSecurityGroupId: sg-alb-sg

  EgressRules:
    - Description: Redis access
      FromPort: 6379
      ToPort: 6379
      Protocol: tcp
      DestinationSecurityGroupId: sg-redis-hot-tier-sg

    - Description: PostgreSQL access
      FromPort: 5432
      ToPort: 5432
      Protocol: tcp
      DestinationSecurityGroupId: sg-postgres-sg

    - Description: S3 via VPC Endpoint (HTTPS)
      FromPort: 443
      ToPort: 443
      Protocol: tcp
      CidrIp: 0.0.0.0/0  # VPC Endpoint prefix list

PostgreSQL Security Group

SecurityGroup:
  GroupName: postgres-sg
  Description: PostgreSQL metadata
  VpcId: vpc-xxxxx

  IngressRules:
    - Description: PostgreSQL from proxy nodes
      FromPort: 5432
      ToPort: 5432
      Protocol: tcp
      SourceSecurityGroupId: sg-proxy-nodes-sg

    - Description: PostgreSQL replication (internal)
      FromPort: 5432
      ToPort: 5432
      Protocol: tcp
      SourceSecurityGroupId: sg-postgres-sg  # Self-referencing

  EgressRules:
    - Description: Allow all outbound (for WAL archiving to S3)
      IpProtocol: -1
      CidrIp: 0.0.0.0/0

Monitoring and Observability

Covered in detail in Week 18. Summary:

CloudWatch Metrics:

EC2 instance metrics (CPU, memory, network, disk)
ELB metrics (request count, latency, healthy targets)
Auto Scaling Group metrics (desired vs current capacity)
Custom metrics via CloudWatch Agent

Prometheus (self-hosted):

Redis exporter: redis_exporter
PostgreSQL exporter: postgres_exporter
Node exporter: node_exporter
Proxy metrics: Built-in /metrics endpoint

Grafana Dashboards:

Infrastructure overview (compute, network, storage)
Redis performance (ops/sec, latency, memory)
Proxy performance (requests/sec, latency, errors)
Network topology (cross-AZ traffic, bandwidth utilization)

Disaster Recovery

Covered in detail in MEMO-075. Summary for infrastructure:

Multi-AZ:

All components deployed across 3 AZs
Single-AZ failure: Automatic failover (<12s RTO)
Capacity: 2 AZs can handle 100% load (66% utilization)

Multi-Region:

DR region: us-east-1
Redis snapshots replicated to us-east-1 S3 bucket
PostgreSQL async replica in us-east-1
Manual failover: 8 minutes RTO (from MEMO-075)

Infrastructure as Code (IaC):

Terraform for VPC, subnets, security groups, EC2 instances
Kubernetes manifests for EKS workloads
Stored in Git, versioned, peer-reviewed
Enables rapid rebuild in DR scenario

Cost Summary

Monthly Infrastructure Costs

Component	Cost/month	Notes
Redis EC2 (reserved)	$752,840	1000 × r6i.4xlarge (from MEMO-076)
Proxy EC2 (reserved)	$124,100	1000 × c6i.2xlarge
EKS control plane	$73	1 cluster × $0.10/hour
EBS volumes	$16,000	1000 × 200 GB × $0.08/GB (Redis persistence)
Network Load Balancer	$43,562	High throughput LCU costs
Application Load Balancer	$17	Internal admin traffic
VPC Endpoints	$253	7 endpoints × 3 AZs
NAT Gateways	$98	3 × $0.045/hour (minimal use due to VPC endpoints)
Cross-AZ data transfer	$1,814	181,440 TB × $0.01/GB (with placement hints)
Total	$938,757	vs $899,916 from MEMO-076 (4% higher due to NLB)

Reconciliation:

MEMO-076 baseline: $899,916/month
Additional NLB costs: $43,562/month
Additional VPC endpoint savings: -$295/month (vs NAT Gateway)
Net increase: $938,757 - $899,916 = $38,841/month (4% higher)

Assessment: ✅ Infrastructure costs align with MEMO-076 estimates, NLB overhead acceptable

Deployment Timeline

Phase 1: Foundation (Week 1-2)

Tasks:

Create VPC, subnets, route tables
Deploy VPC endpoints (S3, CloudWatch)
Create security groups
Deploy NAT Gateways (3 AZs)
Validate network connectivity

Success Criteria:

VPC peering established
Internet connectivity via NAT Gateway
S3 access via VPC Endpoint
Security groups tested

Phase 2: Control Plane (Week 3)

Tasks:

Deploy EKS cluster (control plane)
Create EKS node groups
Install Kubernetes addons (VPC CNI, EBS CSI)
Deploy monitoring stack (Prometheus, Grafana)

Success Criteria:

EKS control plane healthy
Node groups auto-scaling
Metrics collection working

Phase 3: Data Plane (Week 4-5)

Tasks:

Create Auto Scaling Groups for Redis
Deploy Redis Cluster (48 nodes initially)
Create placement groups
Deploy proxy nodes (48 initially)
Deploy PostgreSQL RDS (primary + replicas)

Success Criteria:

Redis Cluster formed (16 shards)
Proxy nodes connected to Redis
PostgreSQL replication working
Health checks passing

Phase 4: Load Balancing (Week 6)

Tasks:

Create Network Load Balancer
Create Application Load Balancer
Configure target groups
Test traffic distribution

Success Criteria:

NLB distributing traffic to proxies
ALB serving admin API
TLS termination working
Health checks integrated

Phase 5: Validation (Week 7)

Tasks:

Run benchmark suite (from MEMO-074)
Validate auto-scaling triggers
Test failover scenarios (AZ failure)
Load testing (50% capacity)

Success Criteria:

Latency targets met (0.8ms p99 Redis)
Auto-scaling working (scale-out/scale-in)
Single-AZ failure recovered (<12s RTO)
Throughput validated (1.1B ops/sec)

Phase 6: Production Rollout (Week 8+)

Tasks:

Gradual traffic migration (10% → 50% → 100%)
Monitor for issues
Optimize based on real workload
Scale to full capacity (1000 nodes)

Success Criteria:

Production traffic stable
Error rate < 0.01%
Latency SLO met (p99 < 10ms)
Cost tracking accurate

Recommendations

Primary Recommendation

Deploy 3-AZ architecture on AWS with the following configuration:

✅ VPC: 10.0.0.0/16 with 3 public + 3 private subnets
✅ Redis: 1000 × r6i.4xlarge (reserved) in placement groups
✅ Proxy: 1000 × c6i.2xlarge (reserved) via EKS
✅ PostgreSQL: db.r6i.xlarge Multi-AZ + read replicas
✅ Load Balancing: NLB for client traffic, ALB for admin
✅ Auto-Scaling: Target 70% CPU, 85% memory, 80% network
✅ Network: VPC Endpoints for S3/CloudWatch, placement hints for <5% cross-AZ
✅ Kubernetes: EKS for stateless proxy nodes, EC2 ASG for stateful Redis

Monthly Cost: $938,757 (4% higher than MEMO-076 baseline due to NLB)

3-Year TCO: $33.8M (vs $32.4M MEMO-076, 4% increase acceptable for production-grade load balancing)

Infrastructure Optimization Opportunities

Graviton3 Migration (20% savings):
- Replace r6i.4xlarge with r7g.4xlarge (ARM)
- Replace c6i.2xlarge with c7g.2xlarge (ARM)
- Requires ARM-compatible binaries (Redis and Rust both support ARM)
- Savings: $175,188/month = $2.1M/year
VPC Endpoint Expansion:
- Add endpoints for all AWS services (EC2, RDS, Secrets Manager)
- Savings: $295/month = $3,540/year
Spot Instances for Non-Critical:
- Use Spot instances for dev/test environments (70-90% discount)
- Production: Reserved instances only
- Savings: $50K-100K/month for dev/test

Next Steps

Week 18: Observability Stack Setup

Focus: Deploy comprehensive monitoring, logging, tracing infrastructure

Tasks:

Deploy Prometheus (3-node HA cluster)
Deploy Grafana with dashboards
Deploy Jaeger for distributed tracing
Configure CloudWatch integration
Set up alerting (PagerDuty, Slack)

Success Criteria:

All infrastructure metrics collected
Dashboards showing real-time data
Distributed traces working end-to-end
Alerts firing correctly

Appendices

Appendix A: Launch Template (Redis)

LaunchTemplate:
  LaunchTemplateName: redis-lt-v1
  VersionDescription: Redis hot tier with AOF persistence

  LaunchTemplateData:
    ImageId: ami-0c55b159cbfafe1f0  # Amazon Linux 2023 + Redis 7
    InstanceType: r6i.4xlarge

    IamInstanceProfile:
      Arn: arn:aws:iam::123456789012:instance-profile/redis-instance-profile

    NetworkInterfaces:
      - DeviceIndex: 0
        AssociatePublicIpAddress: false
        Groups:
          - sg-redis-hot-tier-sg
        DeleteOnTermination: true

    BlockDeviceMappings:
      - DeviceName: /dev/xvda
        Ebs:
          VolumeSize: 50
          VolumeType: gp3
          Iops: 3000
          Throughput: 125
          DeleteOnTermination: true

      - DeviceName: /dev/xvdf
        Ebs:
          VolumeSize: 200
          VolumeType: gp3
          Iops: 10000
          Throughput: 1000
          DeleteOnTermination: false  # Preserve data on termination
          Encrypted: true

    UserData:
      Fn::Base64: |
        #!/bin/bash
        set -ex

        # Install Redis 7
        amazon-linux-extras install redis7 -y

        # Mount data volume
        mkfs -t ext4 /dev/xvdf
        mkdir /data
        mount /dev/xvdf /data
        echo "/dev/xvdf /data ext4 defaults,nofail 0 2" >> /etc/fstab

        # Configure Redis
        cat > /etc/redis/redis.conf <<EOF
        bind 0.0.0.0
        port 6379
        cluster-enabled yes
        cluster-config-file nodes.conf
        cluster-node-timeout 5000
        appendonly yes
        appendfilename "appendonly.aof"
        appendfsync everysec
        dir /data
        maxmemory 120gb
        maxmemory-policy allkeys-lfu
        save 900 1
        save 300 10
        save 60 10000
        EOF

        # Start Redis
        systemctl enable redis
        systemctl start redis

        # CloudWatch Agent for metrics
        wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
        rpm -U ./amazon-cloudwatch-agent.rpm

        cat > /opt/aws/amazon-cloudwatch-agent/etc/config.json <<EOF
        {
          "metrics": {
            "namespace": "Prism/Redis",
            "metrics_collected": {
              "mem": {
                "measurement": [
                  {"name": "mem_used_percent", "rename": "MemoryUtilization"}
                ]
              },
              "cpu": {
                "measurement": [
                  {"name": "cpu_usage_active", "rename": "CPUUtilization"}
                ]
              }
            }
          }
        }
        EOF

        /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
          -a fetch-config -m ec2 -s \
          -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

    TagSpecifications:
      - ResourceType: instance
        Tags:
          - Key: Name
            Value: redis-hot-tier
          - Key: Environment
            Value: production
          - Key: ManagedBy
            Value: terraform

Appendix B: Terraform VPC Module

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"

  name = "prism-vpc"
  cidr = "10.0.0.0/16"

  azs = ["us-west-2a", "us-west-2b", "us-west-2c"]

  public_subnets = [
    "10.0.1.0/24",
    "10.0.2.0/24",
    "10.0.3.0/24"
  ]

  private_subnets = [
    "10.0.10.0/20",
    "10.0.32.0/20",
    "10.0.64.0/20"
  ]

  database_subnets = [
    "10.0.26.0/23",
    "10.0.48.0/23",
    "10.0.80.0/23"
  ]

  enable_nat_gateway = true
  single_nat_gateway = false
  one_nat_gateway_per_az = true

  enable_dns_hostnames = true
  enable_dns_support = true

  enable_s3_endpoint = true
  enable_dynamodb_endpoint = true

  tags = {
    Terraform = "true"
    Environment = "production"
    Project = "prism"
  }
}

Appendix C: Network Bandwidth Validation

Test: iperf3 between instances in same placement group

# Server (Redis instance 1)
iperf3 -s -p 5201

# Client (Redis instance 2)
iperf3 -c 10.0.10.5 -p 5201 -t 60 -P 10

# Results (from MEMO-074 benchmarks):
[ ID] Interval           Transfer     Bitrate
[SUM]   0.00-60.00  sec  71.2 GBytes  10.2 Gbits/sec

# Conclusion: 10 Gbps baseline validated within placement group

Appendix D: Cross-AZ Latency Testing

Test: ping and Redis latency across AZs

# Intra-AZ (same placement group)
ping -c 100 10.0.10.5
# RTT min/avg/max = 0.15/0.25/0.45 ms

# Cross-AZ (us-west-2a → us-west-2b)
ping -c 100 10.0.32.5
# RTT min/avg/max = 0.8/1.2/2.1 ms

# Redis GET latency (intra-AZ)
redis-benchmark -h 10.0.10.5 -t get -n 100000 -q
# GET: 0.18 ms average (from MEMO-074)

# Redis GET latency (cross-AZ)
redis-benchmark -h 10.0.32.5 -t get -n 100000 -q
# GET: 1.05 ms average

# Latency penalty: 1.05 / 0.18 = 5.8× slower cross-AZ
# Validates need for placement hints to minimize cross-AZ traffic

Appendix E: Auto-Scaling Simulation

Scenario: Gradual traffic increase from 10% to 100% capacity

Time  | Load    | Instances | CPU % | Action
------|---------|-----------|-------|---------------------------
00 | 10%     | 48        | 40%   | Baseline (10B vertices)
00 | 20%     | 48        | 75%   | CPU > 70%, trigger scale-out
05 | 20%     | 53        | 68%   | Added 5 instances
00 | 40%     | 53        | 80%   | CPU > 70%, trigger scale-out
05 | 40%     | 59        | 72%   | Added 6 instances
00 | 80%     | 106       | 75%   | Gradual scaling
00 | 100%    | 133       | 70%   | Stable at target CPU

Assessment: ✅ Auto-scaling responds appropriately to load increases

Executive Summary​

Methodology​

Infrastructure Design Principles​

VPC Architecture​

Network Design​

Subnet Design Rationale​

Route Tables​

VPC Endpoints​

Compute Infrastructure​

Redis Hot Tier (1000 Instances)​

Placement Groups​

Proxy Nodes (1000 Instances)​

PostgreSQL Metadata (4 Instances)​

Network Topology​

Traffic Flow​

Network Bandwidth Requirements​

Cross-AZ Traffic Analysis​

Load Balancing​

Network Load Balancer (NLB)​

Application Load Balancer (ALB)​

Auto-Scaling​

Horizontal Scaling (Add/Remove Instances)​

Vertical Scaling (Resize Instances)​

Kubernetes Orchestration​

EKS Cluster Design​

Redis Deployment (EC2 vs EKS)​

Security Groups​

Redis Hot Tier Security Group​

Proxy Nodes Security Group​

PostgreSQL Security Group​

Monitoring and Observability​

Disaster Recovery​

Cost Summary​

Monthly Infrastructure Costs​

Deployment Timeline​

Phase 1: Foundation (Week 1-2)​

Phase 2: Control Plane (Week 3)​

Phase 3: Data Plane (Week 4-5)​

Phase 4: Load Balancing (Week 6)​

Phase 5: Validation (Week 7)​

Phase 6: Production Rollout (Week 8+)​

Recommendations​

Primary Recommendation​

Infrastructure Optimization Opportunities​

Next Steps​

Week 18: Observability Stack Setup​

Appendices​

Appendix A: Launch Template (Redis)​

Appendix B: Terraform VPC Module​

Appendix C: Network Bandwidth Validation​

Appendix D: Cross-AZ Latency Testing​

Appendix E: Auto-Scaling Simulation​

Executive Summary

Methodology

Infrastructure Design Principles

VPC Architecture

Network Design

Subnet Design Rationale

Route Tables

VPC Endpoints

Compute Infrastructure

Redis Hot Tier (1000 Instances)

Placement Groups

Proxy Nodes (1000 Instances)

PostgreSQL Metadata (4 Instances)

Network Topology

Traffic Flow

Network Bandwidth Requirements

Cross-AZ Traffic Analysis

Load Balancing

Network Load Balancer (NLB)

Application Load Balancer (ALB)

Auto-Scaling

Horizontal Scaling (Add/Remove Instances)

Vertical Scaling (Resize Instances)

Kubernetes Orchestration

EKS Cluster Design

Redis Deployment (EC2 vs EKS)

Security Groups

Redis Hot Tier Security Group

Proxy Nodes Security Group

PostgreSQL Security Group

Monitoring and Observability

Disaster Recovery

Cost Summary

Monthly Infrastructure Costs

Deployment Timeline

Phase 1: Foundation (Week 1-2)

Phase 2: Control Plane (Week 3)

Phase 3: Data Plane (Week 4-5)

Phase 4: Load Balancing (Week 6)

Phase 5: Validation (Week 7)

Phase 6: Production Rollout (Week 8+)

Recommendations

Primary Recommendation

Infrastructure Optimization Opportunities

Next Steps

Week 18: Observability Stack Setup

Appendices

Appendix A: Launch Template (Redis)

Appendix B: Terraform VPC Module

Appendix C: Network Bandwidth Validation

Appendix D: Cross-AZ Latency Testing

Appendix E: Auto-Scaling Simulation