RFC-012: Prism Network Gateway (prism-netgw) - Multi-Region Control Plane
Abstract
This RFC proposes prism-netgw, a distributed control plane for managing collections of Prism data gateway clusters across multiple cloud providers, regions, and on-premises environments. prism-netgw handles cluster registration, configuration synchronization, health monitoring, and cross-region routing while tolerating high latency and network partitions.
Motivation
Problem Statement
Organizations deploying Prism at scale face several challenges:
- Multi-Region Deployments: Prism gateways deployed across AWS, GCP, Azure, and on-prem
- Configuration Management: Keeping namespace configs, backend definitions, and policies synchronized
- Cross-Region Discovery: Applications need to discover nearest Prism gateway
- Health Monitoring: Centralized visibility into all Prism instances
- High Latency Tolerance: Cross-region communication experiences 100-500ms latency
- Network Partitions: Cloud VPCs, on-prem networks may have intermittent connectivity
Goals
- Cluster Management: Register, configure, and monitor Prism gateway clusters
- Configuration Sync: Distribute namespace and backend configs across regions
- Service Discovery: Enable clients to discover nearest healthy Prism gateway
- Health Aggregation: Collect health and metrics from all clusters
- Latency Tolerance: Operate correctly with 100-500ms cross-region latency
- Partition Tolerance: Handle network partitions gracefully
- Multi-Cloud: Support AWS, GCP, Azure, on-prem deployments
Non-Goals
- Not a data plane: prism-netgw does NOT proxy data requests (Prism gateways handle that)
- Not a service mesh: Use dedicated service mesh (Istio, Linkerd) for data plane networking
- Not a config database: Uses etcd/Consul for distributed storage
Architecture
High-Level Design
┌─────────────────────────────────────────────────────────────────┐ │ prism-netgw Control Plane │ │ (Raft consensus, multi-region) │ └─────────────────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ┌───────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐ │ AWS Region │ │ GCP Zone │ │ On-Prem DC │ │ us-east-1 │ │ us-cent1 │ │ Seattle │ └──────────────┘ └──────────┘ └─────────────┘ │ │ │ ┌───────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐ │ Prism Cluster│ │ Prism │ │ Prism │ │ (3 nodes) │ │ Cluster │ │ Cluster │ └──────────────┘ └──────────┘ └─────────────┘ │ │ │ ┌───────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐ │ Backends │ │ Backends │ │ Backends │ │ (Postgres, │ │ (Kafka, │ │ (SQLite, │ │ Redis) │ │ NATS) │ │ Postgres) │ └──────────────┘ └──────────┘ └─────────────┘
### Components
graph TB
subgraph "prism-netgw Control Plane"
API[Control Plane API
:9980]
Raft[Raft Consensus
Multi-region]
Store[Distributed Store
etcd/Consul]
Monitor[Health Monitor
Polling]
Sync[Config Sync
Push/Pull]
Discovery[Service Discovery
DNS/gRPC]
end
subgraph "Prism Gateway Cluster (us-east-1)"
Agent1[prism-agent<br/>:9981]
Prism1[Prism Gateway 1]
Prism2[Prism Gateway 2]
Prism3[Prism Gateway 3]
end
subgraph "Prism Gateway Cluster (eu-west-1)"
Agent2[prism-agent<br/>:9981]
Prism4[Prism Gateway 4]
Prism5[Prism Gateway 5]
end
API --> Raft
Raft --> Store
Monitor --> Agent1
Monitor --> Agent2
Sync --> Agent1
Sync --> Agent2
Agent1 --> Prism1
Agent1 --> Prism2
Agent1 --> Prism3
Agent2 --> Prism4
Agent2 --> Prism5
Discovery -.->|Returns nearest| Prism1
Discovery -.->|Returns nearest| Prism4
### Deployment Model
┌─────────────────────────────────────────────────────────────────┐
│ Global Control Plane │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ netgw-leader │───▶│ netgw-node2 │◀──▶│ netgw-node3 │ │
│ │ (us-east-1) │ │ (eu-west-1) │ │ (ap-south-1) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ Raft consensus │ │ │
│ └────────────────────┴────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Deployment Options:
- Multi-Region Active-Standby: 1 leader, N followers in different regions
- Multi-Region Active-Active: Raft quorum across regions (requires low latency between control plane nodes)
- Federated: Independent control planes per region, manual config sync
Core Concepts
1. Cluster Registration
Prism gateway clusters register with prism-netgw:
syntax = "proto3";
package prism.netgw.v1;
message RegisterClusterRequest {
string cluster_id = 1; // Unique cluster identifier (e.g., "aws-us-east-1-prod")
string region = 2; // Cloud region (e.g., "us-east-1")
string cloud_provider = 3; // "aws", "gcp", "azure", "on-prem"
string vpc_id = 4; // VPC or network identifier
repeated string endpoints = 5; // gRPC endpoints for Prism gateways
map<string, string> labels = 6; // Arbitrary labels (e.g., "env": "prod")
}
message RegisterClusterResponse {
string cluster_id = 1;
int64 registration_version = 2; // Version for optimistic concurrency
google.protobuf.Timestamp expires_at = 3; // TTL for heartbeat
}
service ControlPlaneService {
rpc RegisterCluster(RegisterClusterRequest) returns (RegisterClusterResponse);
rpc UnregisterCluster(UnregisterClusterRequest) returns (UnregisterClusterResponse);
rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse);
}
2. Configuration Synchronization
Problem: Namespace and backend configs must be consistent across all clusters.
Solution: Version-controlled config distribution with eventual consistency.
message SyncConfigRequest {
string cluster_id = 1;
int64 current_version = 2; // Cluster's current config version
}
message SyncConfigResponse {
int64 latest_version = 1;
repeated NamespaceConfig namespaces = 2;
repeated BackendConfig backends = 3;
repeated Policy policies = 4;
// Incremental updates if current_version is recent
bool is_incremental = 10;
repeated ConfigChange changes = 11; // Only deltas since current_version
}
message ConfigChange {
enum ChangeType {
ADDED = 0;
MODIFIED = 1;
DELETED = 2;
}
ChangeType type = 1;
string resource_type = 2; // "namespace", "backend", "policy"
string resource_id = 3;
bytes resource_data = 4; // Protobuf-encoded resource
}
Push Model (preferred): prism-netgw → Watch(config_version) → prism-agent ← ConfigUpdate stream ←
**Pull Model** (fallback for high latency):
prism-agent → SyncConfig(current_version) → prism-netgw
← SyncConfigResponse ←