Skip to main content

MEMO-085: Vault Troubleshooting Guide

Executive Summary

This guide provides troubleshooting procedures for common Vault integration issues in Prism. It covers authentication failures, credential generation problems, lease management issues, and operational errors.

Troubleshooting Methodology

1. Check Vault Status

# Check Vault server status
vault status

# Expected healthy output:
# Sealed: false
# Cluster Name: vault-cluster
# HA Enabled: true
# HA Mode: active

  1. Check Vault Logs
# Tail Vault logs
tail -f /vault/logs/vault.log

# Or with journald
journalctl -u vault -f

# Or with Kubernetes
kubectl logs -n vault vault-0 -f

3. Enable Debug Logging

# Temporarily enable debug logging
vault secrets tune -audit-non-hmac-response-keys=* database/

# Or set log level in config
log_level = "debug"

Common Issues

Issue 1: JWT Authentication Failure

Symptoms:

Error: authentication failed: failed to validate JWT: invalid token

Diagnosis:

# 1. Verify JWT token structure
echo $JWT_TOKEN | cut -d'.' -f2 | base64 -d | jq .

# Check claims:
# - aud: must be "prism-patterns"
# - exp: must be in future
# - iss: must match OIDC issuer

# 2. Verify JWT role configuration
vault read auth/jwt/role/prism-patterns

# 3. Test OIDC discovery
curl https://dex.prism.local:5556/dex/.well-known/openid-configuration

# 4. Check JWT auth config
vault read auth/jwt/config

Resolution:

# Fix incorrect OIDC issuer
vault write auth/jwt/config \
oidc_discovery_url="https://correct-dex-url/dex" \
default_role="prism-patterns"

# Fix incorrect audience in role
vault write auth/jwt/role/prism-patterns \
bound_audiences="prism-patterns" \
# ... other params

Issue 2: Credential Generation Failure

Symptoms:

Error: failed to fetch credentials: * permission denied

Diagnosis:

# 1. Check Vault token policies
vault token lookup

# 2. Verify database connection
vault read database/config/redis

# 3. Test database connectivity
redis-cli -h redis.prism.internal PING

# 4. Check database role
vault read database/roles/redis-role

Resolution:

# Fix missing policy permission
vault policy write prism-patterns-policy - <<EOF
path "database/creds/redis-role" {
capabilities = ["read"]
}
EOF

# Fix broken database connection
vault write database/config/redis \
plugin_name="redis-database-plugin" \
host="redis.prism.internal" \
port=6379 \
username="vault-admin" \
password="new-password" \
allowed_roles="redis-role"

# Test connection
vault write -force database/config/redis/rotate-root

Issue 3: Lease Renewal Failure

Symptoms:

2025/11/17 15:58:53 ERROR: Failed to renew lease: permission denied

Diagnosis:

# 1. Check lease status
vault lease lookup database/creds/redis-role/abc123

# 2. Verify token has renewal permission
vault token capabilities sys/leases/renew

# 3. Check if lease is renewable
vault lease lookup database/creds/redis-role/abc123 | grep renewable

Resolution:

# Fix missing renewal permission
vault policy write prism-patterns-policy - <<EOF
path "sys/leases/renew" {
capabilities = ["update"]
}
EOF

# If lease expired, generate new credentials
vault read database/creds/redis-role

Issue 4: High Credential TTL Causing Backend Overload

Symptoms:

Redis: ERR max number of clients reached
PostgreSQL: FATAL: too many connections

Diagnosis:

# Count active database users
# Redis
redis-cli ACL LIST | grep -c "v-jwt-"

# PostgreSQL
psql -c "SELECT count(*) FROM pg_stat_activity WHERE usename LIKE 'v-jwt-%';"

# Check Vault lease count
vault list -format=json sys/leases/lookup/database/creds/redis-role | jq '. | length'

Resolution:

# Reduce credential TTL
vault write database/roles/redis-role \
db_name="redis" \
creation_statements='["ACL SETUSER {{username}} on >{{password}} ~* +@all"]' \
revocation_statements='["ACL DELUSER {{username}}"]' \
default_ttl="30m" \
max_ttl="1h"

# Force revoke old leases
vault lease revoke -prefix database/creds/redis-role

# Increase backend connection limits (Redis)
redis-cli CONFIG SET maxclients 20000

# Increase backend connection limits (PostgreSQL)
# Edit postgresql.conf: max_connections = 500
systemctl restart postgresql

Issue 5: Vault Token Expired Mid-Session

Symptoms:

Error: failed to renew token: token is expired

Diagnosis:

# Check token status
vault token lookup

# Check token TTL
vault token lookup -format=json | jq -r '.data.ttl'

# Check if token is renewable
vault token lookup -format=json | jq -r '.data.renewable'

Resolution:

# Reauthenticate with JWT
vault write auth/jwt/login role="prism-patterns" jwt="$NEW_JWT_TOKEN"

# Increase token TTL in role config
vault write auth/jwt/role/prism-patterns \
token_ttl="2h" \
token_max_ttl="4h" \
# ... other params

Issue 6: Vault Sealed After Restart

Symptoms:

$ vault status
Sealed: true

Resolution:

# Unseal Vault (requires 3 of 5 unseal keys)
vault operator unseal <key-1>
vault operator unseal <key-2>
vault operator unseal <key-3>

# Verify unsealed
vault status
# Sealed: false

# If using auto-unseal (AWS KMS, etc.), check KMS access
aws kms describe-key --key-id <kms-key-id>

Issue 7: Database Admin Credentials Incorrect

Symptoms:

Error: failed to create database credentials: authentication failed

Diagnosis:

# Test admin credentials manually
redis-cli -h redis.prism.internal
AUTH vault-admin admin-password

# Or PostgreSQL
psql -h postgres.prism.internal -U vault-admin -d prism

Resolution:

# Update Vault with correct credentials
vault write database/config/redis \
plugin_name="redis-database-plugin" \
host="redis.prism.internal" \
port=6379 \
username="vault-admin" \
password="correct-password" \
allowed_roles="redis-role"

# Test credential generation
vault read database/creds/redis-role

Issue 8: Network Connectivity Issues

Symptoms:

Error: failed to connect to Vault: dial tcp: lookup vault.prism.internal: no such host

Diagnosis:

# Test DNS resolution
nslookup vault.prism.internal
dig vault.prism.internal

# Test TCP connectivity
nc -zv vault.prism.internal 8200
telnet vault.prism.internal 8200

# Test TLS handshake
openssl s_client -connect vault.prism.internal:8200

# Verify network policies (Kubernetes)
kubectl get networkpolicies -n prism-prod

Resolution:

# Fix DNS resolution
# Add to /etc/hosts
echo "10.0.1.50 vault.prism.internal" >> /etc/hosts

# Fix Kubernetes NetworkPolicy
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-vault-access
namespace: prism-prod
spec:
podSelector:
matchLabels:
app: keyvalue-runner
egress:
- to:
- podSelector:
matchLabels:
app: vault
ports:
- protocol: TCP
port: 8200
EOF

Issue 9: TLS Certificate Verification Failure

Symptoms:

Error: x509: certificate signed by unknown authority

Diagnosis:

# Check certificate chain
openssl s_client -connect vault.prism.internal:8200 -showcerts

# Verify CA certificate
openssl x509 -in /etc/prism/vault-ca.pem -text -noout

# Check certificate expiration
openssl x509 -in /vault/tls/vault.crt -enddate -noout

Resolution:

# Add CA certificate to system trust store
# Ubuntu/Debian
cp /etc/prism/vault-ca.pem /usr/local/share/ca-certificates/vault-ca.crt
update-ca-certificates

# RHEL/CentOS
cp /etc/prism/vault-ca.pem /etc/pki/ca-trust/source/anchors/
update-ca-trust

# Or configure pattern plugin to use CA certificate
# config.yaml
auth:
vault:
tls:
ca_cert: /etc/prism/vault-ca.pem
skip_verify: false

Issue 10: Pattern Plugin Session Manager Failures

Symptoms:

Error: failed to create session: token validation failed

Diagnosis:

Check pattern plugin logs:

# Kubernetes
kubectl logs -n prism-prod keyvalue-runner-abc123 -f

# Docker
docker logs keyvalue-runner -f

# Look for error patterns:
# - "token validation failed" → JWT/OIDC issue
# - "vault authentication failed" → Vault connection issue
# - "failed to fetch credentials" → Database secrets issue

Resolution:

# Verify pattern plugin configuration
kubectl exec -n prism-prod keyvalue-runner-abc123 -- cat /etc/prism/config.yaml

# Check required environment variables
kubectl exec -n prism-prod keyvalue-runner-abc123 -- env | grep VAULT

# Restart pattern plugin
kubectl rollout restart deployment/keyvalue-runner -n prism-prod

Debugging Tools

Vault Audit Log Analysis

# Find authentication failures
grep "auth.login" /vault/logs/audit.log | grep "error"

# Find credential generation
grep "database.creds" /vault/logs/audit.log

# Find lease operations
grep "sys.leases" /vault/logs/audit.log

# Analyze by user
grep "alice@example.com" /vault/logs/audit.log

Prometheus Queries

# Authentication failure rate
rate(vault_audit_log_request_failure{auth_method="jwt"}[5m])

# Credential generation latency
histogram_quantile(0.99, rate(vault_database_secrets_creation_duration_bucket[5m]))

# Active lease count
vault_token_count

# Memory usage
vault_runtime_alloc_bytes

Performance Testing

# Test JWT authentication latency
time vault write auth/jwt/login role="prism-patterns" jwt="$JWT_TOKEN"

# Test credential generation latency
time vault read database/creds/redis-role

# Stress test with multiple concurrent requests
for i in {1..100}; do
vault read database/creds/redis-role &
done
wait

Escalation Procedures

Level 1: Pattern Plugin Team

  • JWT token validation issues
  • Configuration errors
  • Application-level credential usage

Level 2: Platform Team

  • Vault connectivity issues
  • Database backend configuration
  • Network policies

Level 3: Security Team

  • Vault policy issues
  • TLS certificate problems
  • Root token / unseal key access

Level 4: HashiCorp Support

  • Vault bugs
  • Performance degradation
  • Data corruption

Support Information

Log Collection

# Collect Vault logs
vault-support-bundle.sh

# Or manually
tar -czf vault-logs.tar.gz \
/vault/logs/vault.log \
/vault/logs/audit.log \
/vault/config/vault-config.hcl

Health Check Script

#!/bin/bash
# vault-health-check.sh

echo "=== Vault Health Check ==="

echo "1. Vault Status:"
vault status

echo "2. Vault Leader:"
vault operator raft list-peers

echo "3. Active Leases:"
vault list -format=json sys/leases/lookup/database/creds/redis-role | jq '. | length'

echo "4. Memory Usage:"
vault read sys/metrics

echo "5. Recent Errors:"
grep ERROR /vault/logs/vault.log | tail -20

Revision History

  • 2025-11-17: Initial troubleshooting guide