MEMO-085: Vault Troubleshooting Guide
Executive Summary
This guide provides troubleshooting procedures for common Vault integration issues in Prism. It covers authentication failures, credential generation problems, lease management issues, and operational errors.
Troubleshooting Methodology
1. Check Vault Status
# Check Vault server status
vault status
# Expected healthy output:
# Sealed: false
# Cluster Name: vault-cluster
# HA Enabled: true
# HA Mode: active
- Check Vault Logs
# Tail Vault logs
tail -f /vault/logs/vault.log
# Or with journald
journalctl -u vault -f
# Or with Kubernetes
kubectl logs -n vault vault-0 -f
3. Enable Debug Logging
# Temporarily enable debug logging
vault secrets tune -audit-non-hmac-response-keys=* database/
# Or set log level in config
log_level = "debug"
Common Issues
Issue 1: JWT Authentication Failure
Symptoms:
Error: authentication failed: failed to validate JWT: invalid token
Diagnosis:
# 1. Verify JWT token structure
echo $JWT_TOKEN | cut -d'.' -f2 | base64 -d | jq .
# Check claims:
# - aud: must be "prism-patterns"
# - exp: must be in future
# - iss: must match OIDC issuer
# 2. Verify JWT role configuration
vault read auth/jwt/role/prism-patterns
# 3. Test OIDC discovery
curl https://dex.prism.local:5556/dex/.well-known/openid-configuration
# 4. Check JWT auth config
vault read auth/jwt/config
Resolution:
# Fix incorrect OIDC issuer
vault write auth/jwt/config \
oidc_discovery_url="https://correct-dex-url/dex" \
default_role="prism-patterns"
# Fix incorrect audience in role
vault write auth/jwt/role/prism-patterns \
bound_audiences="prism-patterns" \
# ... other params
Issue 2: Credential Generation Failure
Symptoms:
Error: failed to fetch credentials: * permission denied
Diagnosis:
# 1. Check Vault token policies
vault token lookup
# 2. Verify database connection
vault read database/config/redis
# 3. Test database connectivity
redis-cli -h redis.prism.internal PING
# 4. Check database role
vault read database/roles/redis-role
Resolution:
# Fix missing policy permission
vault policy write prism-patterns-policy - <<EOF
path "database/creds/redis-role" {
capabilities = ["read"]
}
EOF
# Fix broken database connection
vault write database/config/redis \
plugin_name="redis-database-plugin" \
host="redis.prism.internal" \
port=6379 \
username="vault-admin" \
password="new-password" \
allowed_roles="redis-role"
# Test connection
vault write -force database/config/redis/rotate-root
Issue 3: Lease Renewal Failure
Symptoms:
2025/11/17 15:58:53 ERROR: Failed to renew lease: permission denied
Diagnosis:
# 1. Check lease status
vault lease lookup database/creds/redis-role/abc123
# 2. Verify token has renewal permission
vault token capabilities sys/leases/renew
# 3. Check if lease is renewable
vault lease lookup database/creds/redis-role/abc123 | grep renewable
Resolution:
# Fix missing renewal permission
vault policy write prism-patterns-policy - <<EOF
path "sys/leases/renew" {
capabilities = ["update"]
}
EOF
# If lease expired, generate new credentials
vault read database/creds/redis-role
Issue 4: High Credential TTL Causing Backend Overload
Symptoms:
Redis: ERR max number of clients reached
PostgreSQL: FATAL: too many connections
Diagnosis:
# Count active database users
# Redis
redis-cli ACL LIST | grep -c "v-jwt-"
# PostgreSQL
psql -c "SELECT count(*) FROM pg_stat_activity WHERE usename LIKE 'v-jwt-%';"
# Check Vault lease count
vault list -format=json sys/leases/lookup/database/creds/redis-role | jq '. | length'
Resolution:
# Reduce credential TTL
vault write database/roles/redis-role \
db_name="redis" \
creation_statements='["ACL SETUSER {{username}} on >{{password}} ~* +@all"]' \
revocation_statements='["ACL DELUSER {{username}}"]' \
default_ttl="30m" \
max_ttl="1h"
# Force revoke old leases
vault lease revoke -prefix database/creds/redis-role
# Increase backend connection limits (Redis)
redis-cli CONFIG SET maxclients 20000
# Increase backend connection limits (PostgreSQL)
# Edit postgresql.conf: max_connections = 500
systemctl restart postgresql
Issue 5: Vault Token Expired Mid-Session
Symptoms:
Error: failed to renew token: token is expired
Diagnosis:
# Check token status
vault token lookup
# Check token TTL
vault token lookup -format=json | jq -r '.data.ttl'
# Check if token is renewable
vault token lookup -format=json | jq -r '.data.renewable'
Resolution:
# Reauthenticate with JWT
vault write auth/jwt/login role="prism-patterns" jwt="$NEW_JWT_TOKEN"
# Increase token TTL in role config
vault write auth/jwt/role/prism-patterns \
token_ttl="2h" \
token_max_ttl="4h" \
# ... other params
Issue 6: Vault Sealed After Restart
Symptoms:
$ vault status
Sealed: true
Resolution:
# Unseal Vault (requires 3 of 5 unseal keys)
vault operator unseal <key-1>
vault operator unseal <key-2>
vault operator unseal <key-3>
# Verify unsealed
vault status
# Sealed: false
# If using auto-unseal (AWS KMS, etc.), check KMS access
aws kms describe-key --key-id <kms-key-id>
Issue 7: Database Admin Credentials Incorrect
Symptoms:
Error: failed to create database credentials: authentication failed
Diagnosis:
# Test admin credentials manually
redis-cli -h redis.prism.internal
AUTH vault-admin admin-password
# Or PostgreSQL
psql -h postgres.prism.internal -U vault-admin -d prism
Resolution:
# Update Vault with correct credentials
vault write database/config/redis \
plugin_name="redis-database-plugin" \
host="redis.prism.internal" \
port=6379 \
username="vault-admin" \
password="correct-password" \
allowed_roles="redis-role"
# Test credential generation
vault read database/creds/redis-role
Issue 8: Network Connectivity Issues
Symptoms:
Error: failed to connect to Vault: dial tcp: lookup vault.prism.internal: no such host
Diagnosis:
# Test DNS resolution
nslookup vault.prism.internal
dig vault.prism.internal
# Test TCP connectivity
nc -zv vault.prism.internal 8200
telnet vault.prism.internal 8200
# Test TLS handshake
openssl s_client -connect vault.prism.internal:8200
# Verify network policies (Kubernetes)
kubectl get networkpolicies -n prism-prod
Resolution:
# Fix DNS resolution
# Add to /etc/hosts
echo "10.0.1.50 vault.prism.internal" >> /etc/hosts
# Fix Kubernetes NetworkPolicy
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-vault-access
namespace: prism-prod
spec:
podSelector:
matchLabels:
app: keyvalue-runner
egress:
- to:
- podSelector:
matchLabels:
app: vault
ports:
- protocol: TCP
port: 8200
EOF
Issue 9: TLS Certificate Verification Failure
Symptoms:
Error: x509: certificate signed by unknown authority
Diagnosis:
# Check certificate chain
openssl s_client -connect vault.prism.internal:8200 -showcerts
# Verify CA certificate
openssl x509 -in /etc/prism/vault-ca.pem -text -noout
# Check certificate expiration
openssl x509 -in /vault/tls/vault.crt -enddate -noout
Resolution:
# Add CA certificate to system trust store
# Ubuntu/Debian
cp /etc/prism/vault-ca.pem /usr/local/share/ca-certificates/vault-ca.crt
update-ca-certificates
# RHEL/CentOS
cp /etc/prism/vault-ca.pem /etc/pki/ca-trust/source/anchors/
update-ca-trust
# Or configure pattern plugin to use CA certificate
# config.yaml
auth:
vault:
tls:
ca_cert: /etc/prism/vault-ca.pem
skip_verify: false
Issue 10: Pattern Plugin Session Manager Failures
Symptoms:
Error: failed to create session: token validation failed
Diagnosis:
Check pattern plugin logs:
# Kubernetes
kubectl logs -n prism-prod keyvalue-runner-abc123 -f
# Docker
docker logs keyvalue-runner -f
# Look for error patterns:
# - "token validation failed" → JWT/OIDC issue
# - "vault authentication failed" → Vault connection issue
# - "failed to fetch credentials" → Database secrets issue
Resolution:
# Verify pattern plugin configuration
kubectl exec -n prism-prod keyvalue-runner-abc123 -- cat /etc/prism/config.yaml
# Check required environment variables
kubectl exec -n prism-prod keyvalue-runner-abc123 -- env | grep VAULT
# Restart pattern plugin
kubectl rollout restart deployment/keyvalue-runner -n prism-prod
Debugging Tools
Vault Audit Log Analysis
# Find authentication failures
grep "auth.login" /vault/logs/audit.log | grep "error"
# Find credential generation
grep "database.creds" /vault/logs/audit.log
# Find lease operations
grep "sys.leases" /vault/logs/audit.log
# Analyze by user
grep "alice@example.com" /vault/logs/audit.log
Prometheus Queries
# Authentication failure rate
rate(vault_audit_log_request_failure{auth_method="jwt"}[5m])
# Credential generation latency
histogram_quantile(0.99, rate(vault_database_secrets_creation_duration_bucket[5m]))
# Active lease count
vault_token_count
# Memory usage
vault_runtime_alloc_bytes
Performance Testing
# Test JWT authentication latency
time vault write auth/jwt/login role="prism-patterns" jwt="$JWT_TOKEN"
# Test credential generation latency
time vault read database/creds/redis-role
# Stress test with multiple concurrent requests
for i in {1..100}; do
vault read database/creds/redis-role &
done
wait
Escalation Procedures
Level 1: Pattern Plugin Team
- JWT token validation issues
- Configuration errors
- Application-level credential usage
Level 2: Platform Team
- Vault connectivity issues
- Database backend configuration
- Network policies
Level 3: Security Team
- Vault policy issues
- TLS certificate problems
- Root token / unseal key access
Level 4: HashiCorp Support
- Vault bugs
- Performance degradation
- Data corruption
Support Information
Log Collection
# Collect Vault logs
vault-support-bundle.sh
# Or manually
tar -czf vault-logs.tar.gz \
/vault/logs/vault.log \
/vault/logs/audit.log \
/vault/config/vault-config.hcl
Health Check Script
#!/bin/bash
# vault-health-check.sh
echo "=== Vault Health Check ==="
echo "1. Vault Status:"
vault status
echo "2. Vault Leader:"
vault operator raft list-peers
echo "3. Active Leases:"
vault list -format=json sys/leases/lookup/database/creds/redis-role | jq '. | length'
echo "4. Memory Usage:"
vault read sys/metrics
echo "5. Recent Errors:"
grep ERROR /vault/logs/vault.log | tail -20
Related Documents
- MEMO-084: Vault Operator Guide
- RFC-062: Unified Authentication and Session Management
- ADR-007: Authentication and Authorization
- MEMO-083: Phase 2 Vault Integration Implementation Plan
Revision History
- 2025-11-17: Initial troubleshooting guide