Skip to main content

ADR-067: Vault PKI mTLS Architecture

Status

Proposed

Context

Prism requires secure service-to-service communication with automatic certificate management. Current implementation uses JWT tokens for authentication, but lacks transport-level security for backend-to-backend communication.

Requirements:

  1. Service-to-service authentication via mTLS
  2. Automatic certificate rotation (7-day max lifetime)
  3. Fine-grained authorization based on certificate metadata
  4. Vault PKI integration for certificate generation
  5. Proxy integration for mTLS termination
  6. ServiceIdentity abstraction for cert-based identity extraction

Constraints:

  • Certificates must contain SANs for service name, namespace, cluster
  • AuthZ policies should use cert metadata (service name, namespace) for fine-grained access control
  • Certificate rotation must be transparent to services
  • Must work with existing ServiceIdentity and ServiceSession managers

Decision

We will implement Vault PKI integration for mTLS with the following architecture:

1. Vault PKI Secrets Engine

Enable Vault PKI to generate short-lived mTLS certificates:

# Vault PKI configuration
# Enable PKI secrets engine
vault secrets enable pki

# Configure max lease duration (7 days for service certs)
vault secrets tune -max-lease-ttl=168h pki

# Generate root CA
vault write -format=json pki/root/generate/internal \
subject="Prism Service CA" \
ttl=87600h \
| jq -r .data.certificate > /tmp/prism-ca.pem

# Configure CA certificate
vault write pki/config/urls \
issuing_certificates="http://vault:8200/v1/pki/ca" \
crl_distribution_points="http://vault:8200/v1/pki/crl"

# Create role for service certificates
vault write pki/roles/prism-services \
allowed_domains="prism.local" \
allow_subdomains=true \
max_ttl="168h" \
key_type="rsa" \
key_bits="2048" \
ou="Prism" \
organization="Prism Data Layer"

2. mTLS Certificate Generation

Generate certificates with embedded identity in SANs:

// Generate mTLS certificate for a service
cert, err := vaultClient.GenerateServiceCert(ctx, &GenerateCertRequest{
Role: "prism-services",
CommonName: "api-gateway.prism.local",
AltNames: &AltNames{
DNSNames: []string{"api-gateway.prism.local"},
IPs: []string{},
URIs: []string{},
EmailSANs: []string{},
},
TTL: 167 * time.Hour, // Just under max
Format: "pem",
CommonNameType: "dns",
ExcludeCNfromSans: true,
Usages: []string{
"digital signature",
"key encipherment",
"client auth",
"server auth",
},
})

// Certificate will contain SANs:
// DNS:api-gateway.prism.local
// URI:spiffe://prism.local/api-gateway
// Email:api-gateway@prism.local

3. Certificate Metadata Embedded

Embed service identity in certificate fields:

// proto/prism/authz/vault.proto
message CertificateIdentity {
string service_name = 1;
string namespace = 2;
string cluster = 3;
string service_account = 4;
string aws_account_id = 5;
string aws_role_name = 6;
ServiceIdentityType type = 7;
}

4. ServiceIdentity Extraction from mTLS

Extract identity from certificate:

// pkg/authz/mtls_identity.go
func ExtractIdentityFromMTLSCert(cert *x509.Certificate) (*ServiceIdentity, error) {
identity := &ServiceIdentity{}

// Extract from Subject
if len(cert.Subject.Organization) > 0 {
identity.ServiceName = cert.Subject.Organization[0]
}

// Extract from SANs
for _, uri := range cert.URIs {
if uri.Scheme == "spiffe" {
// Parse: spiffe://prism.local/service-name
parts := strings.Split(uri.Path, "/")
if len(parts) >= 3 {
identity.ServiceName = parts[2]
}
}
}

// Validate
if !identity.IsValid() {
return nil, fmt.Errorf("invalid service identity from certificate")
}

return identity, nil
}

5. Proxy mTLS Integration

Add mTLS support to Prism proxy:

// prism-proxy/src/mtls.rs
use rustls::{ServerConfig, ClientConfig};
use std::sync::Arc;

pub struct MTLSServer {
config: Arc<ServerConfig>,
vault_client: Arc<VaultClient>,
cert_cache: CertCache,
}

impl MTLSServer {
pub async fn new(vault_client: Arc<VaultClient>) -> Self {
let mut config = ServerConfig::builder()
.with_no_client_auth()
.with_cert_resolver(Arc::new(CertResolver::new(vault_client)));

config.alpn_protocols = vec!["h2".into(), "http/1.1".into()];

Self {
config: Arc::new(config),
vault_client,
cert_cache: CertCache::new(),
}
}

pub async fn verify_client_cert(&self, cert: &Certificate) -> Result<ServiceIdentity, Error> {
// Verify cert chain against Vault PKI CA
self.cert_cache.verify_chain(cert)?;

// Extract identity from cert SANs
let identity = extract_identity_from_cert(cert)?;

// Validate against policy
self.policy_engine.check(&identity)?;

Ok(identity)
}
}

// Certificate resolver that fetches from Vault
struct CertResolver {
vault_client: Arc<VaultClient>,
}

impl ResolvesServerCert for CertResolver {
fn resolve(&self, client_hello: ClientHello) -> Option<Arc<ServerCert>> {
let server_name = client_hello.server_name()?.to_owned();

// Check cache first
if let Some(cert) = self.certificate_cache.get(&server_name) {
return Some(cert);
}

// Generate new cert from Vault
let cert = self.vault_client
.generate_cert(&server_name)
.await
.ok()?;

// Cache for renewal
self.certificate_cache.insert(server_name, cert.clone());

Some(cert)
}
}

6. Automatic Certificate Rotation

Rotate certs before expiration:

// pkg/authz/mtls_rotation.go
type MTLSCertManager struct {
vaultClient *VaultClient
certCache map[string]*TLSCertificate
mu sync.RWMutex
}

func (m *MTLSCertManager) StartRotation(ctx context.Context, service string) {
go func() {
ticker := time.NewTicker(1 * time.Hour)
defer ticker.Stop()

for {
select {
case <-ctx.Done():
return
case <-ticker.C:
m.rotateIfNecessary(service)
}
}
}()
}

func (m *MTLSCertManager) rotateIfNecessary(service string) {
m.mu.RLock()
cert, exists := m.certCache[service]
m.mu.RUnlock()

if !exists {
m.generateCert(service)
return
}

// Calculate rotation time (7 days before expiry)
rotationTime := cert.NotAfter.Add(-7 * 24 * time.Hour)

if time.Now().After(rotationTime) {
log.Printf("Rotating cert for %s (expires: %v)", service, cert.NotAfter)
m.generateCert(service)
}
}

func (m *MTLSCertManager) generateCert(service string) error {
cert, err := m.vaultClient.GenerateServiceCert(context.Background(), &GenerateCertRequest{
Role: "prism-services",
CommonName: service + ".prism.local",
AltNames: &AltNames{DNSNames: []string{service + ".prism.local"}},
TTL: 167 * time.Hour,
KeyType: "rsa",
KeyBits: 2048,
})
if err != nil {
return err
}

m.mu.Lock()
m.certCache[service] = cert
m.mu.Unlock()

return nil
}

Rationale

Why Vault PKI?

Alternatives Considered:

  1. Manual Cert Management (Rejected)

    • Pros: Simple, no external dependencies
    • Cons: ❌ No rotation, ❌ Manual rotation, ❌ No scalability
    • Verdict: Not feasible for production
  2. Let's Encrypt (Rejected)

    • Pros: Free, automated
    • Cons: ❌ Not internal service certs, ❌ Rate limits, ❌ Public CA
    • Verdict: Wrong use case
  3. HashiCorp Vault PKI (Selected)

    • Pros: ✅ Internal PKI, ✅ Automatic rotation, ✅ Fine-grained policies, ✅ Audit logging
    • Cons: ⚠️ Additional dependency
    • Verdict: Best fit for internal service mTLS
  4. SPIFFE/SPIRE (Rejected)

    • Pros: Industry standard for SPIFFE)
    • Cons: ⚠️ More complex setup, ⚠️ Additional infrastructure
    • Verdict: Good but Vault PKI is simpler for Prism's needs

Architecture Decisions

1. Certificate Lifetimes (7 days)

  • Short enough for security, long enough to reduce rotation overhead
  • Renewal happens 7 days before expiry (50% of lifetime)
  • Grace period for network issues

2. ServiceIdentity Extraction from Cert

  • Use SANs for service name, namespace, cluster
  • URI SANs for SPIFFE-style addressing
  • Avoids token passing, relying on cert verification

3. Proxy-Centric Rotation

  • Proxy manages certs, not individual services
  • Services connect via proxy, don't need to handle rotation
  • Centralized cert management

4. Vault as CA (NotIntermediate)

  • Direct Vault PKI integration
  • Simpler than CA chain for initial implementation
  • Can add intermediate CA later if needed

Consequences

Positive

  • Automatic Rotation: Certificates auto-rotate every 7 days
  • Auditable: All cert requests logged in Vault
  • Secure: mTLS provides transport-level security
  • Scalable: Vault can issue thousands of certs/sec
  • Identity Binding: Certificate contains service identity in SANs
  • Fine-grained AuthZ: Can use cert metadata for per-service policies

Negative

  • Additional Dependency: Must run Vault in production
  • Latency: Cert generation adds ~100ms to first request
  • Complexity: More moving parts to debug
  • Single Point of Failure: Vault PKI must be available

Mitigations

  • Cache certificates in proxy
  • Health checks for Vault connectivity
  • Fallback to cached certs if Vault unavailable
  • Graceful degradation with warning logs

Implementation Notes

Migration Steps

Phase 1: Setup Vault PKI (Week 1)

  1. Enable PKI secrets engine
  2. Configure root CA
  3. Create service roles
  4. Test basic cert generation

Phase 2: Proxy Integration (Week 2)

  1. Implement mTLS server in proxy
  2. Add cert rotation logic
  3. Implement identity extraction
  4. Add integration tests

Phase 3: Backend Integration (Week 3)

  1. Update backend drivers to accept mTLS
  2. Implement client cert verification
  3. Add test coverage
  4. Performance tuning

Phase 4: Production Rollout (Week 4)

  1. Deploy to staging
  2. Monitor cert rotation
  3. Gradual rollout
  4. Decommission old auth methods

Key Gotchas

  1. Timestamp Skew: Ensure all hosts synch time via NTP
  2. DNS Resolution: Services must resolve Vault PKI domain
  3. Lease Management: Certificate leases must be revoked on shutdown
  4. Certificate Chain: Must verify full chain, not just leaf cert

Testing Strategy

// tests/integration/mtls_test.go
func TestMTLS_CertRotation(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
defer cancel()

// Start Vault with PKI
vault := startVault(ctx, t)
defer vault.Stop(ctx)

// Start proxy with mTLS
proxy := startProxy(ctx, t, proxyConfig{
MTLSEnabled: true,
VaultAddr: vault.Address(),
})
defer proxy.Stop()

// Make requests for 10 days
for day := 0; day < 10; day++ {
time.Sleep(24 * time.Hour)

// Verify cert changes every 7 days
currentCert := proxy.GetCurrentCert()

if day > 0 && day % 7 == 0 {
// Should have rotated
require.NotEqual(t, prevCert, currentCert)
}

prevCert = currentCert
}
}

Revision History

  • 2026-04-19: Initial draft - Vault PKI mTLS architecture