ADR-067: Vault PKI mTLS Architecture
Status
Proposed
Context
Prism requires secure service-to-service communication with automatic certificate management. Current implementation uses JWT tokens for authentication, but lacks transport-level security for backend-to-backend communication.
Requirements:
- Service-to-service authentication via mTLS
- Automatic certificate rotation (7-day max lifetime)
- Fine-grained authorization based on certificate metadata
- Vault PKI integration for certificate generation
- Proxy integration for mTLS termination
- ServiceIdentity abstraction for cert-based identity extraction
Constraints:
- Certificates must contain SANs for service name, namespace, cluster
- AuthZ policies should use cert metadata (service name, namespace) for fine-grained access control
- Certificate rotation must be transparent to services
- Must work with existing ServiceIdentity and ServiceSession managers
Decision
We will implement Vault PKI integration for mTLS with the following architecture:
1. Vault PKI Secrets Engine
Enable Vault PKI to generate short-lived mTLS certificates:
# Vault PKI configuration
# Enable PKI secrets engine
vault secrets enable pki
# Configure max lease duration (7 days for service certs)
vault secrets tune -max-lease-ttl=168h pki
# Generate root CA
vault write -format=json pki/root/generate/internal \
subject="Prism Service CA" \
ttl=87600h \
| jq -r .data.certificate > /tmp/prism-ca.pem
# Configure CA certificate
vault write pki/config/urls \
issuing_certificates="http://vault:8200/v1/pki/ca" \
crl_distribution_points="http://vault:8200/v1/pki/crl"
# Create role for service certificates
vault write pki/roles/prism-services \
allowed_domains="prism.local" \
allow_subdomains=true \
max_ttl="168h" \
key_type="rsa" \
key_bits="2048" \
ou="Prism" \
organization="Prism Data Layer"
2. mTLS Certificate Generation
Generate certificates with embedded identity in SANs:
// Generate mTLS certificate for a service
cert, err := vaultClient.GenerateServiceCert(ctx, &GenerateCertRequest{
Role: "prism-services",
CommonName: "api-gateway.prism.local",
AltNames: &AltNames{
DNSNames: []string{"api-gateway.prism.local"},
IPs: []string{},
URIs: []string{},
EmailSANs: []string{},
},
TTL: 167 * time.Hour, // Just under max
Format: "pem",
CommonNameType: "dns",
ExcludeCNfromSans: true,
Usages: []string{
"digital signature",
"key encipherment",
"client auth",
"server auth",
},
})
// Certificate will contain SANs:
// DNS:api-gateway.prism.local
// URI:spiffe://prism.local/api-gateway
// Email:api-gateway@prism.local
3. Certificate Metadata Embedded
Embed service identity in certificate fields:
// proto/prism/authz/vault.proto
message CertificateIdentity {
string service_name = 1;
string namespace = 2;
string cluster = 3;
string service_account = 4;
string aws_account_id = 5;
string aws_role_name = 6;
ServiceIdentityType type = 7;
}
4. ServiceIdentity Extraction from mTLS
Extract identity from certificate:
// pkg/authz/mtls_identity.go
func ExtractIdentityFromMTLSCert(cert *x509.Certificate) (*ServiceIdentity, error) {
identity := &ServiceIdentity{}
// Extract from Subject
if len(cert.Subject.Organization) > 0 {
identity.ServiceName = cert.Subject.Organization[0]
}
// Extract from SANs
for _, uri := range cert.URIs {
if uri.Scheme == "spiffe" {
// Parse: spiffe://prism.local/service-name
parts := strings.Split(uri.Path, "/")
if len(parts) >= 3 {
identity.ServiceName = parts[2]
}
}
}
// Validate
if !identity.IsValid() {
return nil, fmt.Errorf("invalid service identity from certificate")
}
return identity, nil
}
5. Proxy mTLS Integration
Add mTLS support to Prism proxy:
// prism-proxy/src/mtls.rs
use rustls::{ServerConfig, ClientConfig};
use std::sync::Arc;
pub struct MTLSServer {
config: Arc<ServerConfig>,
vault_client: Arc<VaultClient>,
cert_cache: CertCache,
}
impl MTLSServer {
pub async fn new(vault_client: Arc<VaultClient>) -> Self {
let mut config = ServerConfig::builder()
.with_no_client_auth()
.with_cert_resolver(Arc::new(CertResolver::new(vault_client)));
config.alpn_protocols = vec!["h2".into(), "http/1.1".into()];
Self {
config: Arc::new(config),
vault_client,
cert_cache: CertCache::new(),
}
}
pub async fn verify_client_cert(&self, cert: &Certificate) -> Result<ServiceIdentity, Error> {
// Verify cert chain against Vault PKI CA
self.cert_cache.verify_chain(cert)?;
// Extract identity from cert SANs
let identity = extract_identity_from_cert(cert)?;
// Validate against policy
self.policy_engine.check(&identity)?;
Ok(identity)
}
}
// Certificate resolver that fetches from Vault
struct CertResolver {
vault_client: Arc<VaultClient>,
}
impl ResolvesServerCert for CertResolver {
fn resolve(&self, client_hello: ClientHello) -> Option<Arc<ServerCert>> {
let server_name = client_hello.server_name()?.to_owned();
// Check cache first
if let Some(cert) = self.certificate_cache.get(&server_name) {
return Some(cert);
}
// Generate new cert from Vault
let cert = self.vault_client
.generate_cert(&server_name)
.await
.ok()?;
// Cache for renewal
self.certificate_cache.insert(server_name, cert.clone());
Some(cert)
}
}
6. Automatic Certificate Rotation
Rotate certs before expiration:
// pkg/authz/mtls_rotation.go
type MTLSCertManager struct {
vaultClient *VaultClient
certCache map[string]*TLSCertificate
mu sync.RWMutex
}
func (m *MTLSCertManager) StartRotation(ctx context.Context, service string) {
go func() {
ticker := time.NewTicker(1 * time.Hour)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
m.rotateIfNecessary(service)
}
}
}()
}
func (m *MTLSCertManager) rotateIfNecessary(service string) {
m.mu.RLock()
cert, exists := m.certCache[service]
m.mu.RUnlock()
if !exists {
m.generateCert(service)
return
}
// Calculate rotation time (7 days before expiry)
rotationTime := cert.NotAfter.Add(-7 * 24 * time.Hour)
if time.Now().After(rotationTime) {
log.Printf("Rotating cert for %s (expires: %v)", service, cert.NotAfter)
m.generateCert(service)
}
}
func (m *MTLSCertManager) generateCert(service string) error {
cert, err := m.vaultClient.GenerateServiceCert(context.Background(), &GenerateCertRequest{
Role: "prism-services",
CommonName: service + ".prism.local",
AltNames: &AltNames{DNSNames: []string{service + ".prism.local"}},
TTL: 167 * time.Hour,
KeyType: "rsa",
KeyBits: 2048,
})
if err != nil {
return err
}
m.mu.Lock()
m.certCache[service] = cert
m.mu.Unlock()
return nil
}
Rationale
Why Vault PKI?
Alternatives Considered:
-
Manual Cert Management (Rejected)
- Pros: Simple, no external dependencies
- Cons: ❌ No rotation, ❌ Manual rotation, ❌ No scalability
- Verdict: Not feasible for production
-
Let's Encrypt (Rejected)
- Pros: Free, automated
- Cons: ❌ Not internal service certs, ❌ Rate limits, ❌ Public CA
- Verdict: Wrong use case
-
HashiCorp Vault PKI (Selected)
- Pros: ✅ Internal PKI, ✅ Automatic rotation, ✅ Fine-grained policies, ✅ Audit logging
- Cons: ⚠️ Additional dependency
- Verdict: Best fit for internal service mTLS
-
SPIFFE/SPIRE (Rejected)
- Pros: Industry standard for SPIFFE)
- Cons: ⚠️ More complex setup, ⚠️ Additional infrastructure
- Verdict: Good but Vault PKI is simpler for Prism's needs
Architecture Decisions
1. Certificate Lifetimes (7 days)
- Short enough for security, long enough to reduce rotation overhead
- Renewal happens 7 days before expiry (50% of lifetime)
- Grace period for network issues
2. ServiceIdentity Extraction from Cert
- Use SANs for service name, namespace, cluster
- URI SANs for SPIFFE-style addressing
- Avoids token passing, relying on cert verification
3. Proxy-Centric Rotation
- Proxy manages certs, not individual services
- Services connect via proxy, don't need to handle rotation
- Centralized cert management
4. Vault as CA (NotIntermediate)
- Direct Vault PKI integration
- Simpler than CA chain for initial implementation
- Can add intermediate CA later if needed
Consequences
Positive
- Automatic Rotation: Certificates auto-rotate every 7 days
- Auditable: All cert requests logged in Vault
- Secure: mTLS provides transport-level security
- Scalable: Vault can issue thousands of certs/sec
- Identity Binding: Certificate contains service identity in SANs
- Fine-grained AuthZ: Can use cert metadata for per-service policies
Negative
- Additional Dependency: Must run Vault in production
- Latency: Cert generation adds ~100ms to first request
- Complexity: More moving parts to debug
- Single Point of Failure: Vault PKI must be available
Mitigations
- Cache certificates in proxy
- Health checks for Vault connectivity
- Fallback to cached certs if Vault unavailable
- Graceful degradation with warning logs
Implementation Notes
Migration Steps
Phase 1: Setup Vault PKI (Week 1)
- Enable PKI secrets engine
- Configure root CA
- Create service roles
- Test basic cert generation
Phase 2: Proxy Integration (Week 2)
- Implement mTLS server in proxy
- Add cert rotation logic
- Implement identity extraction
- Add integration tests
Phase 3: Backend Integration (Week 3)
- Update backend drivers to accept mTLS
- Implement client cert verification
- Add test coverage
- Performance tuning
Phase 4: Production Rollout (Week 4)
- Deploy to staging
- Monitor cert rotation
- Gradual rollout
- Decommission old auth methods
Key Gotchas
- Timestamp Skew: Ensure all hosts synch time via NTP
- DNS Resolution: Services must resolve Vault PKI domain
- Lease Management: Certificate leases must be revoked on shutdown
- Certificate Chain: Must verify full chain, not just leaf cert
Testing Strategy
// tests/integration/mtls_test.go
func TestMTLS_CertRotation(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
defer cancel()
// Start Vault with PKI
vault := startVault(ctx, t)
defer vault.Stop(ctx)
// Start proxy with mTLS
proxy := startProxy(ctx, t, proxyConfig{
MTLSEnabled: true,
VaultAddr: vault.Address(),
})
defer proxy.Stop()
// Make requests for 10 days
for day := 0; day < 10; day++ {
time.Sleep(24 * time.Hour)
// Verify cert changes every 7 days
currentCert := proxy.GetCurrentCert()
if day > 0 && day % 7 == 0 {
// Should have rotated
require.NotEqual(t, prevCert, currentCert)
}
prevCert = currentCert
}
}
Related Documents
- ADR-050: Topaz Policy Engine - Authorization policies
- RFC-063: Proxy Authn/Authz - Gateway authentication
- RFC-064: SAML Federation - Federation authentication
- RFC-065: SCIM Provisioning - User provisioning
Revision History
- 2026-04-19: Initial draft - Vault PKI mTLS architecture