Vault in Production: HA, Auto-Unseal, and Disaster Recovery
Production Architecture Overview
A production Vault deployment looks nothing like a dev server. You need high availability so that a single node failure does not take down secret access across your entire infrastructure. You need persistent storage so that secrets survive restarts and node replacements. You need auto-unseal so that nodes can recover without human intervention at 3 AM. You need audit logging so that every secret access is recorded for security investigations and compliance requirements. And you need monitoring so that you know about problems before they become outages.
The cost of getting this wrong is high. If Vault goes down, every system that depends on it for credentials stops working. Database connections fail, API calls return authentication errors, TLS certificates cannot be renewed, and CI/CD pipelines halt. Vault is not just another service in your infrastructure; it is a dependency of nearly every other service.
Here is the target architecture for a robust production deployment:
+-------------------+
| Load Balancer |
| (L4 TCP/TLS) |
+--------+----------+
|
+------------+------------+
| | |
+-----+------+ +--+----+ +-----+------+
| Vault 1 | |Vault 2| | Vault 3 |
| (Leader) | |(Stby) | | (Stby) |
+-----+-------+ +--+----+ +-----+-------+
| | |
+------------+------------+
|
+------+------+
| Raft |
| Storage |
| (Local SSD)|
+------+------+
|
+------+------+
| Cloud KMS |
| (Auto-Unseal)|
+------+------+
|
+----------------+----------------+
| | |
+------+------+ +-----+------+ +------+------+
| Audit Log 1 | | Audit Log 2| | Telemetry |
| (Local File)| | (Syslog) | | (Prometheus)|
+-------------+ +------------+ +-------------+
Sizing Recommendations
| Component | Small (dev/staging) | Medium (production) | Large (enterprise) |
|---|---|---|---|
| Nodes | 3 | 5 | 5-7 |
| CPU per node | 2 cores | 4 cores | 8 cores |
| RAM per node | 4 GB | 8 GB | 16 GB |
| Storage per node | 25 GB SSD | 50 GB SSD | 100 GB NVMe |
| Network | Standard | Low latency | Dedicated VLAN |
| Active leases | Up to 10,000 | Up to 100,000 | Up to 500,000 |
Vault is CPU and I/O bound during normal operations. The encryption and decryption of secrets, the Raft consensus protocol, and audit log writes all require consistent low-latency disk I/O. Always use SSDs or NVMe drives for Vault storage and audit logs.
Raft Integrated Storage
Vault supports several storage backends, but Raft integrated storage is now the recommended option for most deployments. It eliminates the need for a separate Consul cluster, reducing operational complexity significantly. With Raft, Vault manages its own distributed consensus, replication, and leader election.
Why Raft Over Consul
| Factor | Raft Integrated Storage | Consul Backend |
|---|---|---|
| Operational complexity | Lower (no separate cluster) | Higher (manage Consul too) |
| Infrastructure cost | Vault nodes only | Vault + Consul nodes (6+ total) |
| Network requirements | Vault-to-Vault only | Vault-to-Consul + Consul-to-Consul |
| Snapshot/backup | Built-in vault operator raft snapshot | Separate Consul snapshot process |
| Performance | Direct local storage | Network hop to Consul on every write |
| HA support | Yes (built-in Raft leader election) | Yes (via Consul session locking) |
| Recommended by HashiCorp | Yes (since Vault 1.4+) | Still supported but not preferred |
| Debugging | Single system to troubleshoot | Two distributed systems to troubleshoot |
Full Raft Configuration
# /etc/vault.d/vault.hcl -- Node 1 (initial leader)
# Raft storage backend
storage "raft" {
path = "/opt/vault/data"
node_id = "vault-1"
# Performance tuning
performance_multiplier = 1
# Autopilot configuration for automatic cluster management
autopilot_reconcile_interval = "10s"
# Peer discovery: list all other nodes
retry_join {
leader_api_addr = "https://vault-2.internal:8200"
leader_ca_cert_file = "/opt/vault/tls/ca.crt"
leader_client_cert_file = "/opt/vault/tls/vault.crt"
leader_client_key_file = "/opt/vault/tls/vault.key"
}
retry_join {
leader_api_addr = "https://vault-3.internal:8200"
leader_ca_cert_file = "/opt/vault/tls/ca.crt"
leader_client_cert_file = "/opt/vault/tls/vault.crt"
leader_client_key_file = "/opt/vault/tls/vault.key"
}
}
# API listener with TLS
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
# TLS configuration
tls_cert_file = "/opt/vault/tls/vault.crt"
tls_key_file = "/opt/vault/tls/vault.key"
tls_min_version = "tls13"
# Client certificate verification (for mTLS)
tls_require_and_verify_client_cert = false
tls_client_ca_file = "/opt/vault/tls/ca.crt"
# Request handling
max_request_size = 33554432 # 32 MB
max_request_duration = "90s"
# Telemetry
telemetry {
unauthenticated_metrics_access = true
}
}
# Cluster communication addresses
api_addr = "https://vault-1.internal:8200"
cluster_addr = "https://vault-1.internal:8201"
cluster_name = "vault-prod"
# Web UI
ui = true
# Logging
log_level = "info"
log_file = "/var/log/vault/vault.log"
log_rotate_duration = "24h"
log_rotate_max_files = 30
# Telemetry for monitoring
telemetry {
prometheus_retention_time = "30s"
disable_hostname = true
usage_gauge_period = "10m"
maximum_gauge_cardinality = 500
}
# Disable memory lock only if the system capability cannot be set
# disable_mlock = true
# Default and maximum lease TTLs
default_lease_ttl = "1h"
max_lease_ttl = "768h"
Each node gets the same configuration with its own node_id and api_addr. The retry_join blocks tell each node where to find the other members. TLS is configured for both the API listener and the Raft cluster communication.
Cluster Initialization and Formation
# Initialize the first node
export VAULT_ADDR='https://vault-1.internal:8200'
export VAULT_CACERT='/opt/vault/tls/ca.crt'
vault operator init -key-shares=5 -key-threshold=3
# Save the output securely (5 unseal keys + 1 root token)
# Unseal the first node
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3
# The other nodes automatically join via retry_join
# They need to be unsealed as well
export VAULT_ADDR='https://vault-2.internal:8200'
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3
export VAULT_ADDR='https://vault-3.internal:8200'
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3
# Verify the Raft cluster
export VAULT_ADDR='https://vault-1.internal:8200'
vault operator raft list-peers
Expected output:
Node Address State Voter
---- ------- ----- -----
vault-1 vault-1.internal:8201 leader true
vault-2 vault-2.internal:8201 follower true
vault-3 vault-3.internal:8201 follower true
Raft Autopilot
Vault includes an autopilot feature for Raft that automates cluster management tasks:
# Check autopilot status
vault operator raft autopilot get-config
# Configure autopilot
vault operator raft autopilot set-config \
cleanup-dead-servers=true \
dead-server-last-contact-threshold=24h \
min-quorum=3 \
server-stabilization-time=10s
# View cluster health from autopilot's perspective
vault operator raft autopilot state
Autopilot can automatically remove dead servers from the cluster, promoting new voters when a node has been unreachable for longer than the threshold. This reduces manual intervention for common failure scenarios.
Auto-Unseal
Manual unsealing is acceptable for a single development instance, but it is operationally untenable in production. If a node restarts at 2 AM, you do not want to page three operators to provide unseal keys. Auto-unseal delegates the unseal operation to a trusted cloud Key Management Service (KMS). With auto-unseal, the Vault master key is encrypted by the KMS key and stored alongside the encrypted data. When Vault starts, it calls the KMS API to decrypt the master key automatically.
AWS KMS Auto-Unseal
# Add to vault.hcl
seal "awskms" {
region = "us-east-1"
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234-efgh-5678"
# Optional: use a specific AWS profile or endpoint
# access_key = "..." # Prefer IAM roles instead
# secret_key = "..."
# endpoint = "https://kms.us-east-1.amazonaws.com"
}
The Vault server's IAM role needs these permissions on the KMS key:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:DescribeKey"
],
"Resource": "arn:aws:kms:us-east-1:123456789012:key/abcd-1234-efgh-5678"
}
]
}
Best practices for the KMS key:
- Use a dedicated KMS key for Vault unsealing (do not share with other services)
- Enable key rotation on the KMS key (AWS handles this transparently)
- Restrict access to the KMS key to only the Vault IAM role
- Enable CloudTrail logging on the KMS key for audit purposes
- Consider multi-region KMS keys if Vault spans regions
Azure Key Vault Auto-Unseal
seal "azurekeyvault" {
tenant_id = "your-tenant-id"
vault_name = "vault-unseal-keyvault"
key_name = "vault-unseal-key"
# client_id = "..." # Use managed identity instead
# client_secret = "..."
# environment = "AZUREPUBLICCLOUD"
}
For Azure, use a Managed Identity assigned to the Vault VM or AKS node pool. The identity needs the following role assignments on the Key Vault:
- Key Vault Crypto User (for Encrypt/Decrypt operations)
- Key Vault Reader (for DescribeKey)
GCP Cloud KMS Auto-Unseal
seal "gcpckms" {
project = "my-project"
region = "global"
key_ring = "vault-keyring"
crypto_key = "vault-unseal-key"
# credentials = "/path/to/service-account.json" # Use workload identity instead
}
HashiCorp Cloud Platform (HCP) Transit Auto-Unseal
You can also use another Vault cluster (or HCP Vault) for auto-unseal via the Transit secret engine:
seal "transit" {
address = "https://hcp-vault.example.com:8200"
token = "hvs.transit-unseal-token"
disable_renewal = false
key_name = "autounseal"
mount_path = "transit/"
# tls_ca_cert = "/path/to/ca.crt"
}
Migrating from Shamir to Auto-Unseal
If you have an existing Vault cluster using Shamir keys and want to migrate to auto-unseal:
# 1. Add the seal stanza to vault.hcl on all nodes
# 2. Stop the Vault service on all nodes
sudo systemctl stop vault
# 3. Start Vault on the leader node first
sudo systemctl start vault
# 4. The node starts in a migration state
# Provide the old Shamir keys with the -migrate flag
vault operator unseal -migrate SHAMIR_KEY_1
vault operator unseal -migrate SHAMIR_KEY_2
vault operator unseal -migrate SHAMIR_KEY_3
# 5. Vault migrates the seal and generates recovery keys
# Save the recovery keys securely
# 6. Start the remaining nodes -- they auto-unseal via KMS
sudo systemctl start vault # on vault-2
sudo systemctl start vault # on vault-3
# 7. Verify the migration
vault status
# Seal Type should show "awskms" (or your chosen KMS)
After migration, Vault uses the KMS key for unsealing. The Shamir keys are replaced by recovery keys. Recovery keys cannot unseal Vault but are needed for certain administrative operations like generating a root token.
Audit Devices
Audit logging is non-negotiable in production. Every request and response is logged, including who accessed what secret, when, and from where. Secret values in the audit log are HMAC-hashed using a salt derived from the barrier key, so the log itself does not contain plaintext secrets, but you can compare two HMAC values to determine if the same secret was accessed twice.
Enabling Multiple Audit Devices
Always enable at least two audit devices for redundancy. If all audit devices fail (e.g., disk full for a file audit device), Vault stops responding to all requests. This is a security feature, not a bug. Vault will not serve secrets if it cannot log the access.
# Primary: file-based audit log
vault audit enable -path=file-primary file \
file_path=/var/log/vault/audit.log \
log_raw=false \
hmac_accessor=true \
mode=0600
# Secondary: syslog for centralized log collection
vault audit enable -path=syslog-backup syslog \
tag="vault" \
facility="AUTH"
# Tertiary: socket for real-time log streaming
vault audit enable -path=socket-elk socket \
address="logstash.internal:9200" \
socket_type="tcp"
# Verify all audit devices are enabled
vault audit list -detailed
Audit Log Format and Analysis
Each audit entry is a JSON object containing request and response details:
{
"time": "2026-03-23T10:15:30.123456Z",
"type": "response",
"auth": {
"client_token": "hmac-sha256:abc123def456...",
"accessor": "hmac-sha256:789ghi012jkl...",
"display_name": "kubernetes-production-webapp-sa",
"policies": ["webapp", "default"],
"token_policies": ["webapp", "default"],
"metadata": {
"role": "webapp",
"service_account_name": "webapp-sa",
"service_account_namespace": "production",
"service_account_uid": "12345-abcde-67890"
},
"entity_id": "entity-uuid-here",
"token_type": "service",
"token_ttl": 3600,
"token_issue_time": "2026-03-23T09:15:30Z"
},
"request": {
"id": "request-uuid-here",
"operation": "read",
"path": "secret/data/webapp/production",
"remote_address": "10.0.1.45",
"remote_port": 49152,
"namespace": {
"id": "root"
},
"wrap_ttl": 0
},
"response": {
"mount_type": "kv",
"mount_accessor": "kv_abc123",
"mount_is_external_plugin": false,
"mount_running_version": "v0.16.1+builtin"
}
}
This tells you exactly which service account, from which pod IP, read which secret, at what time, and which policies authorized the access.
Audit Log Querying
# Find all accesses to a specific secret path
cat /var/log/vault/audit.log | \
jq 'select(.request.path == "secret/data/webapp/production")'
# Find all denied requests (for debugging policy issues)
cat /var/log/vault/audit.log | \
jq 'select(.response.data.error != null) | {time: .time, path: .request.path, error: .response.data.error}'
# Find all requests from a specific IP
cat /var/log/vault/audit.log | \
jq 'select(.request.remote_address == "10.0.1.45")'
# Count requests per path in the last hour
cat /var/log/vault/audit.log | \
jq -r 'select(.type == "request") | .request.path' | sort | uniq -c | sort -rn | head -20
# Find all root token usage (should be zero in normal operations)
cat /var/log/vault/audit.log | \
jq 'select(.auth.policies | index("root"))'
Audit Log Rotation
# Configure logrotate for Vault audit logs
# /etc/logrotate.d/vault
cat > /etc/logrotate.d/vault <<'LOGROTATE'
/var/log/vault/audit.log {
daily
rotate 90
compress
delaycompress
missingok
notifempty
create 0600 vault vault
postrotate
/usr/bin/kill -HUP $(cat /var/run/vault.pid 2>/dev/null) 2>/dev/null || true
endscript
}
LOGROTATE
For high-volume environments, consider streaming audit logs directly to a centralized logging system (ELK, Splunk, Datadog) rather than writing to local files. This reduces local disk pressure and provides better querying capabilities.
Performance Standby Nodes and Client-Side Caching
In open-source Vault with Raft, standby nodes forward all requests to the leader. This means the leader handles 100% of the request load. To reduce pressure on the leader, use client-side caching through Vault Agent.
Vault Agent Caching
Deploy Vault Agent as a sidecar or DaemonSet that caches responses locally:
# vault-agent-cache.hcl
auto_auth {
method "kubernetes" {
mount_path = "auth/kubernetes"
config = {
role = "webapp"
}
}
sink "file" {
config = {
path = "/home/vault/.vault-token"
}
}
}
cache {
use_auto_auth_token = true
persist = {
type = "kubernetes"
path = "/vault/agent-cache"
}
}
listener "tcp" {
address = "127.0.0.1:8100"
tls_disable = true
}
Applications connect to the local agent at http://127.0.0.1:8100 instead of directly to the Vault server. The agent caches responses and handles token renewal automatically, reducing the number of requests that reach the Vault cluster.
Performance Standby Nodes (Enterprise)
Vault Enterprise supports performance standby nodes that can serve read requests directly, distributing read load across the entire cluster:
# Enterprise-only feature
# Standby nodes automatically serve reads when this is enabled
# No additional configuration needed beyond the standard HA setup
Batch Tokens for High-Volume Operations
For services that make many short-lived requests, batch tokens reduce storage pressure because they are not persisted:
# Create a batch token (not persisted to storage)
vault token create -type=batch -policy="webapp" -ttl="1h"
Batch tokens are ideal for Kubernetes pods that authenticate once and make a few API calls before terminating.
Vault Telemetry and Monitoring
Vault exposes metrics via a telemetry interface. Configure it to feed into your monitoring stack for proactive alerting.
Prometheus Configuration
# Already in vault.hcl from the Raft configuration section
telemetry {
prometheus_retention_time = "30s"
disable_hostname = true
usage_gauge_period = "10m"
maximum_gauge_cardinality = 500
}
# prometheus-scrape-config.yaml
- job_name: vault
scheme: https
tls_config:
ca_file: /etc/prometheus/vault-ca.crt
bearer_token_file: /etc/prometheus/vault-metrics-token
metrics_path: /v1/sys/metrics
params:
format: ['prometheus']
static_configs:
- targets:
- vault-1.internal:8200
- vault-2.internal:8200
- vault-3.internal:8200
relabel_configs:
- source_labels: [__address__]
target_label: vault_node
Create a dedicated policy and token for Prometheus:
vault policy write prometheus-metrics - <<'EOF'
path "sys/metrics" {
capabilities = ["read"]
}
EOF
vault token create -policy="prometheus-metrics" -period="768h" -orphan -display-name="prometheus"
Key Metrics and Alert Rules
| Metric | Warning Threshold | Critical Threshold | Meaning |
|---|---|---|---|
vault.core.handle_request.duration | p99 over 200ms | p99 over 500ms | API latency increasing |
vault.expire.num_leases | over 50,000 | over 100,000 | Active lease accumulation |
vault.runtime.alloc_bytes | Sustained increase over 1h | over 80% of available RAM | Possible memory leak |
vault.audit.log_response.duration | p99 over 50ms | p99 over 100ms | Audit device I/O bottleneck |
vault.raft.leader.lastContact | over 200ms | over 500ms | Raft consensus degraded |
vault.core.unsealed | N/A | equals 0 | Node is sealed |
vault.raft.apply | Sustained 500/s increase | Sustained 1000/s increase | Abnormal write pressure |
vault.raft.commitTime | p99 over 25ms | p99 over 100ms | Raft commit slow (disk I/O) |
vault.token.count | over 50,000 | over 100,000 | Token accumulation |
vault.barrier.get.duration | p99 over 10ms | p99 over 50ms | Storage backend slow |
Prometheus Alert Rules
# vault-alerts.yaml
groups:
- name: vault
rules:
- alert: VaultSealed
expr: vault_core_unsealed == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Vault node is sealed"
description: "Vault node {{ $labels.instance }} has been sealed for more than 1 minute."
- alert: VaultHighLatency
expr: histogram_quantile(0.99, rate(vault_core_handle_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Vault p99 latency is high"
description: "p99 request latency on {{ $labels.instance }} is {{ $value }}s"
- alert: VaultLeaseAccumulation
expr: vault_expire_num_leases > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "High number of active leases"
description: "{{ $labels.instance }} has {{ $value }} active leases"
- alert: VaultRaftLeaderLost
expr: vault_raft_leader_lastContact_seconds > 0.5
for: 2m
labels:
severity: critical
annotations:
summary: "Raft leader contact degraded"
- alert: VaultAuditDeviceSlow
expr: histogram_quantile(0.99, rate(vault_audit_log_response_duration_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Audit device I/O is slow"
Health Check Script
#!/bin/bash
# vault-health-check.sh -- quick cluster health assessment
VAULT_NODES=("vault-1.internal" "vault-2.internal" "vault-3.internal")
CA_CERT="/opt/vault/tls/ca.crt"
echo "=== Vault Cluster Health Check ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""
for node in "${VAULT_NODES[@]}"; do
RESPONSE=$(curl -sk --cacert "$CA_CERT" "https://${node}:8200/v1/sys/health" -w '\n%{http_code}' 2>/dev/null)
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)
case $HTTP_CODE in
200) STATUS="ACTIVE (leader)" ;;
429) STATUS="STANDBY" ;;
472) STATUS="DR SECONDARY" ;;
473) STATUS="PERFORMANCE STANDBY" ;;
501) STATUS="UNINITIALIZED" ;;
503) STATUS="SEALED" ;;
*) STATUS="UNREACHABLE (HTTP $HTTP_CODE)" ;;
esac
VERSION=$(echo "$BODY" | jq -r '.version // "unknown"' 2>/dev/null)
echo "${node}: ${STATUS} (v${VERSION})"
done
echo ""
echo "=== Raft Peers ==="
vault operator raft list-peers 2>/dev/null || echo "Cannot list peers (not authenticated or not leader)"
echo ""
echo "=== Audit Devices ==="
vault audit list 2>/dev/null || echo "Cannot list audit devices (not authenticated)"
Backup and Restore
Raft Snapshots
Raft snapshots capture the entire Vault state, including all secrets, policies, auth configuration, and engine settings. Snapshots are encrypted with the barrier key, so they are safe to store in external systems, but you need a working Vault cluster (or the unseal keys) to restore them.
# Take a manual snapshot
vault operator raft snapshot save /backup/vault-snapshot-$(date +%Y%m%d-%H%M%S).snap
# Verify a snapshot (check its metadata)
vault operator raft snapshot inspect /backup/vault-snapshot-20260323-060000.snap
# Restore from a snapshot (WARNING: replaces all current data)
vault operator raft snapshot restore /backup/vault-snapshot-20260323-060000.snap
# Force restore (required if the cluster has different seal keys)
vault operator raft snapshot restore -force /backup/vault-snapshot-20260323-060000.snap
Automated Backup Script
#!/bin/bash
set -euo pipefail
# vault-backup.sh -- automated Vault backup
BACKUP_DIR="/backup/vault"
S3_BUCKET="s3://vault-backups/snapshots"
KMS_KEY_ID="arn:aws:kms:us-east-1:123456789012:key/backup-key-id"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE="${BACKUP_DIR}/vault-snapshot-${TIMESTAMP}.snap"
# Ensure backup directory exists
mkdir -p "$BACKUP_DIR"
# Take the snapshot
echo "$(date): Taking Vault snapshot..."
vault operator raft snapshot save "$SNAPSHOT_FILE"
SNAP_SIZE=$(stat --format="%s" "$SNAPSHOT_FILE")
echo "$(date): Snapshot saved (${SNAP_SIZE} bytes): ${SNAPSHOT_FILE}"
# Verify the snapshot
echo "$(date): Verifying snapshot..."
vault operator raft snapshot inspect "$SNAPSHOT_FILE" > /dev/null 2>&1
echo "$(date): Snapshot verification passed"
# Upload to S3 with server-side encryption
echo "$(date): Uploading to S3..."
aws s3 cp "$SNAPSHOT_FILE" \
"${S3_BUCKET}/${TIMESTAMP}.snap" \
--sse aws:kms \
--sse-kms-key-id "$KMS_KEY_ID" \
--metadata "vault-version=$(vault version -format=json | jq -r '.version'),node=$(hostname)"
# Create a "latest" pointer
aws s3 cp "$SNAPSHOT_FILE" \
"${S3_BUCKET}/latest.snap" \
--sse aws:kms \
--sse-kms-key-id "$KMS_KEY_ID"
# Clean up old local snapshots
echo "$(date): Cleaning up snapshots older than ${RETENTION_DAYS} days..."
find "$BACKUP_DIR" -name "vault-snapshot-*.snap" -mtime +${RETENTION_DAYS} -delete
# Clean up old S3 snapshots (keep 30 days)
aws s3api list-objects-v2 --bucket vault-backups --prefix snapshots/ \
--query "Contents[?LastModified<='$(date -u -d "${RETENTION_DAYS} days ago" +%Y-%m-%dT%H:%M:%SZ)'].Key" \
--output text | tr '\t' '\n' | while read key; do
[ -n "$key" ] && aws s3 rm "s3://vault-backups/$key"
done
echo "$(date): Backup complete"
Schedule this with cron or a Kubernetes CronJob:
# Crontab entry: backup every 6 hours
0 */6 * * * /usr/local/bin/vault-backup.sh >> /var/log/vault/backup.log 2>&1
Backup Strategy Checklist
- Take snapshots every 4-6 hours at minimum (more frequently for high-change environments)
- Store snapshots in at least two geographic locations
- Encrypt snapshots at rest with a separate KMS key (not the Vault unseal key)
- Test restores quarterly to a non-production cluster
- Retain snapshots for at least 30 days (90 days for compliance-heavy environments)
- Alert on backup failures within one backup cycle
- Document the restore procedure and keep it accessible outside of Vault
Disaster Recovery
Recovery Scenarios
Scenario 1: Single node failure (most common)
With a three-node Raft cluster, losing one node does not affect availability. The remaining two nodes maintain quorum and elect a new leader if needed. Replace the failed node:
# If the failed node can be recovered, just restart it
sudo systemctl restart vault
# With auto-unseal, the node automatically unseals and rejoins
# If the node is permanently lost, remove it from the cluster
vault operator raft remove-peer vault-3
# Provision a new node with the same vault.hcl configuration
# (update node_id and api_addr)
# Start Vault -- it joins automatically via retry_join
sudo systemctl start vault
# With auto-unseal, it unseals and syncs data from the leader
# Verify the cluster
vault operator raft list-peers
Scenario 2: Quorum loss (majority of nodes lost)
If you lose two out of three nodes, the cluster cannot elect a leader and all operations stop. This requires a snapshot restore:
# 1. Stop all remaining Vault processes
sudo systemctl stop vault # on all nodes
# 2. On the node that will become the new leader, clean the data directory
sudo rm -rf /opt/vault/data/*
# 3. Start Vault on that node only
sudo systemctl start vault
# 4. Initialize a new single-node cluster
vault operator init -key-shares=1 -key-threshold=1
# Or with auto-unseal, it initializes with recovery keys
# 5. Unseal the node
vault operator unseal UNSEAL_KEY
# 6. Restore the latest snapshot
vault operator raft snapshot restore -force /backup/latest.snap
# 7. The restore replaces all data including seal configuration
# You may need to unseal again after restore
# 8. Start the other nodes -- they join and sync via retry_join
sudo systemctl start vault # on vault-2
sudo systemctl start vault # on vault-3
# 9. Verify the cluster
vault operator raft list-peers
vault status
Scenario 3: Complete infrastructure loss
Everything is gone. Recover from off-site backup:
# 1. Pull the latest snapshot from S3
aws s3 cp s3://vault-backups/snapshots/latest.snap /tmp/vault-restore.snap
# 2. Provision new infrastructure (VMs, networking, TLS certs)
# 3. Install and configure Vault on the first node
# 4. Initialize and restore
vault operator init
vault operator unseal # provide keys
vault operator raft snapshot restore -force /tmp/vault-restore.snap
# 5. Join additional nodes
# Start vault-2 and vault-3 -- retry_join handles the rest
# 6. Verify everything
vault operator raft list-peers
vault secrets list
vault auth list
vault audit list
Scenario 4: KMS key unavailable (auto-unseal failure)
If the cloud KMS key used for auto-unseal becomes unavailable:
# 1. Check KMS connectivity
# For AWS:
aws kms describe-key --key-id "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
# 2. If the KMS key is temporarily unavailable, wait for the cloud provider to resolve
# Vault will automatically unseal once KMS access is restored
# 3. If the KMS key is permanently lost (deleted, access revoked):
# You need the recovery keys to generate a new root token
vault operator generate-root -init
vault operator generate-root -otp="..." -nonce="..." RECOVERY_KEY_1
vault operator generate-root -otp="..." -nonce="..." RECOVERY_KEY_2
vault operator generate-root -otp="..." -nonce="..." RECOVERY_KEY_3
# 4. Migrate to a new KMS key:
# Update vault.hcl with the new KMS key
# Restart Vault with -migrate flag (similar to Shamir migration)
Disaster Recovery Replication (Enterprise)
Vault Enterprise supports DR replication, where a secondary cluster receives a continuous stream of data from the primary:
# On the primary cluster
vault write -f sys/replication/dr/primary/enable
# Generate a secondary activation token
vault write sys/replication/dr/primary/secondary-token id="dr-secondary"
# On the secondary cluster
vault write sys/replication/dr/secondary/enable token="SECONDARY_TOKEN_HERE"
# Verify replication status
vault read sys/replication/dr/status
The DR secondary is a hot standby. In a disaster, promote it to primary:
# Generate a DR operation token using recovery keys
vault operator generate-root -dr-token -init
vault operator generate-root -dr-token RECOVERY_KEY_1
vault operator generate-root -dr-token RECOVERY_KEY_2
vault operator generate-root -dr-token RECOVERY_KEY_3
# Promote the secondary
vault write sys/replication/dr/secondary/promote dr_operation_token="DR_TOKEN_HERE"
Security Hardening Checklist
TLS Configuration
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/opt/vault/tls/vault-full-chain.crt"
tls_key_file = "/opt/vault/tls/vault.key"
tls_min_version = "tls13"
# Recommended cipher suites for TLS 1.2 (if TLS 1.3 is not universally supported)
# tls_cipher_suites = "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
# Enable mTLS for client verification
tls_require_and_verify_client_cert = false # Set to true for mTLS
tls_client_ca_file = "/opt/vault/tls/client-ca.crt"
# Disable HTTP/2 if not needed (reduces attack surface)
# http2_enable = false
}
Network Policies (Kubernetes)
# vault-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vault-server
namespace: vault
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: vault
policyTypes:
- Ingress
- Egress
ingress:
# Allow API traffic from application namespaces
- from:
- namespaceSelector:
matchLabels:
vault-access: "true"
ports:
- port: 8200
protocol: TCP
# Allow Raft cluster communication between Vault nodes
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: vault
ports:
- port: 8201
protocol: TCP
# Allow Prometheus scraping
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- port: 8200
protocol: TCP
egress:
# Allow Raft cluster communication
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: vault
ports:
- port: 8201
protocol: TCP
# Allow DNS resolution
- to: []
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Allow KMS for auto-unseal
- to: []
ports:
- port: 443
protocol: TCP
# Allow database connections for dynamic secrets
- to: []
ports:
- port: 5432
protocol: TCP
- port: 3306
protocol: TCP
Linux System Hardening
# Create the vault user with minimal permissions
sudo useradd --system --home /opt/vault --shell /usr/sbin/nologin vault
# Set file permissions
sudo chown -R vault:vault /opt/vault
sudo chmod 700 /opt/vault/data
sudo chmod 600 /opt/vault/tls/vault.key
sudo chmod 644 /opt/vault/tls/vault.crt
# Enable memory locking (prevent secrets from being swapped to disk)
sudo setcap cap_ipc_lock=+ep /usr/local/bin/vault
# Restrict SSH access to Vault nodes
# Use bastion host or VPN-only access
# Enable kernel hardening
echo "kernel.dmesg_restrict = 1" | sudo tee -a /etc/sysctl.d/vault.conf
echo "kernel.kptr_restrict = 2" | sudo tee -a /etc/sysctl.d/vault.conf
echo "net.ipv4.conf.all.send_redirects = 0" | sudo tee -a /etc/sysctl.d/vault.conf
sudo sysctl --system
Least-Privilege Practices
Create an admin policy that can manage Vault without having root access:
vault policy write vault-admin - <<'EOF'
# Manage secrets engines
path "sys/mounts/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
path "sys/mounts" {
capabilities = ["read", "list"]
}
# Manage policies
path "sys/policies/acl/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
path "sys/policies/acl" {
capabilities = ["list"]
}
# Manage auth methods
path "sys/auth/*" {
capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
path "sys/auth" {
capabilities = ["read", "list"]
}
# Manage audit devices
path "sys/audit/*" {
capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
path "sys/audit" {
capabilities = ["read", "list"]
}
# View system health and leader status
path "sys/health" {
capabilities = ["read"]
}
path "sys/leader" {
capabilities = ["read"]
}
# Manage leases
path "sys/leases/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
# Read metrics
path "sys/metrics" {
capabilities = ["read"]
}
# Raft operations
path "sys/storage/raft/*" {
capabilities = ["read", "list"]
}
# DENY direct access to application secrets
# Admins manage infrastructure, not application data
path "secret/*" {
capabilities = ["deny"]
}
path "database/*" {
capabilities = ["deny"]
}
EOF
# Revoke the root token after creating admin access
vault token revoke ROOT_TOKEN_HERE
Root token generation should be used only for initial setup. After creating admin policies and users, revoke the root token. If you need root access later, generate a new root token using the recovery keys (or unseal keys):
# Generate a new root token (requires threshold of recovery/unseal keys)
vault operator generate-root -init
# Follow the prompts to provide recovery keys
Operational Runbooks
Runbook: Sealed Node
# 1. Check seal status
vault status
# 2. If using auto-unseal, check KMS connectivity
# AWS:
aws kms describe-key --key-id "YOUR_KMS_KEY_ARN"
# Azure:
az keyvault key show --vault-name "vault-unseal" --name "unseal-key"
# GCP:
gcloud kms keys describe vault-unseal-key --location global --keyring vault-keyring
# 3. Check Vault logs for seal-related errors
journalctl -u vault -n 100 --no-pager | grep -i "seal\|unseal\|kms"
# 4. If KMS is healthy, restart the Vault service
sudo systemctl restart vault
# 5. Monitor for automatic unseal (wait 30 seconds)
sleep 30 && vault status
# 6. If auto-unseal fails, check IAM permissions
# Verify the Vault process has access to the KMS key
# 7. Last resort: manual unseal with recovery keys (if migrated from Shamir)
# This should not be needed with auto-unseal unless KMS is permanently unavailable
Runbook: Leader Election Failure
# 1. Check Raft cluster status
vault operator raft list-peers
# 2. Check if quorum is maintained (need majority of nodes)
# 3-node cluster needs 2 nodes
# 5-node cluster needs 3 nodes
# 3. If a node is unreachable, check its status
curl -sk https://vault-3.internal:8200/v1/sys/health
# 4. If the node is permanently down, remove it from the cluster
vault operator raft remove-peer vault-3
# 5. If the cluster lost quorum entirely:
# Follow Scenario 2 from the Disaster Recovery section
# 6. Verify cluster health after resolution
vault operator raft list-peers
vault operator raft autopilot state
vault status
Runbook: High Latency
# 1. Check system-level metrics
top -bn1 | head -20
iostat -x 1 3
free -h
# 2. Check Vault metrics
curl -s --header "X-Vault-Token: $VAULT_TOKEN" \
"${VAULT_ADDR}/v1/sys/metrics?format=json" | \
jq '.Gauges[] | select(.Name | contains("runtime"))'
# 3. Check active lease count (high lease count causes performance degradation)
vault read -format=json sys/metrics | \
jq '.data.Gauges[] | select(.Name == "vault.expire.num_leases")'
# 4. If lease count is high, identify and revoke unnecessary leases
vault list -format=json sys/leases/lookup/database/ 2>/dev/null
vault lease revoke -prefix database/creds/unused-role/
# 5. Check audit device performance (slow disk causes request delays)
ls -la /var/log/vault/audit.log
df -h /var/log/
iostat -x 1 1 | grep -E "Device|$(findmnt -n -o SOURCE /var/log | xargs basename)"
# 6. If audit log file is too large, rotate it
sudo logrotate -f /etc/logrotate.d/vault
# 7. Check Raft commit times (indicates storage backend performance)
curl -s --header "X-Vault-Token: $VAULT_TOKEN" \
"${VAULT_ADDR}/v1/sys/metrics?format=json" | \
jq '.Summaries[] | select(.Name | contains("raft.commitTime"))'
Runbook: Emergency Seal
If you suspect a breach or unauthorized access:
# 1. Seal Vault immediately -- this stops ALL secret access
vault operator seal
# 2. This is a deliberate outage -- inform your team immediately
# 3. Investigate the audit logs (from backup copies, since Vault is sealed)
# Copy audit log to investigation workstation
cp /var/log/vault/audit.log /tmp/investigation/
# 4. Search for suspicious activity
cat /tmp/investigation/audit.log | \
jq 'select(.auth.display_name == "suspicious-identity")' > suspicious.json
cat /tmp/investigation/audit.log | \
jq 'select(.request.remote_address | startswith("unknown-range"))' >> suspicious.json
# 5. Identify compromised tokens and prepare revocation commands
cat suspicious.json | jq -r '.auth.accessor' | sort -u > compromised-accessors.txt
# 6. Unseal when investigation is complete and remediation is planned
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3
# 7. Revoke compromised tokens
while read accessor; do
vault token revoke -accessor "$accessor"
done < compromised-accessors.txt
# 8. Revoke leases associated with compromised identities
vault lease revoke -prefix COMPROMISED_PATH/
# 9. Rotate any secrets that may have been exposed
vault write -f database/config/myapp-db/rotate-root
# 10. Document the incident and update runbooks
Runbook: Vault Upgrade
# 1. Read the changelog and upgrade notes for the target version
# 2. Test the upgrade in a non-production environment first
# 3. Take a snapshot before upgrading
vault operator raft snapshot save /backup/pre-upgrade-$(date +%Y%m%d).snap
# 4. Upgrade standby nodes first
# On vault-2 (standby):
sudo systemctl stop vault
sudo cp /usr/local/bin/vault /usr/local/bin/vault.bak
sudo cp /tmp/vault-new-version /usr/local/bin/vault
sudo systemctl start vault
# Verify the standby node is healthy
curl -sk https://vault-2.internal:8200/v1/sys/health
# 5. Repeat for vault-3 (standby)
# 6. Step down the leader to trigger failover to an upgraded node
vault operator step-down
# 7. Upgrade the old leader (now a standby)
# On vault-1:
sudo systemctl stop vault
sudo cp /usr/local/bin/vault /usr/local/bin/vault.bak
sudo cp /tmp/vault-new-version /usr/local/bin/vault
sudo systemctl start vault
# 8. Verify all nodes are running the new version
for node in vault-{1,2,3}.internal; do
echo "${node}: $(curl -sk https://${node}:8200/v1/sys/health | jq -r '.version')"
done
# 9. Verify cluster health
vault operator raft list-peers
vault status
Summary
Running Vault in production requires planning across several dimensions: storage (Raft integrated storage for simplicity and performance), availability (three or five-node cluster for quorum tolerance), unsealing (cloud KMS auto-unseal for zero-touch recovery), observability (dual audit devices plus Prometheus telemetry), backup (automated Raft snapshots stored off-site), and security (TLS 1.3, network policies, least-privilege policies, revoked root tokens). Treat your Vault cluster as the most critical piece of infrastructure you operate because if Vault goes down, every system that depends on it for credentials will follow. Build the automation, monitoring, and runbooks before you need them, because the middle of an incident is the wrong time to figure out your recovery procedure. Start with a three-node Raft cluster, enable auto-unseal with cloud KMS, configure at least two audit devices, automate snapshots to run every six hours, deploy Prometheus alerting on key metrics, and keep your runbooks in a location that does not require Vault access to read them.
DevSecOps Lead
Security-first mindset in everything I ship. From zero-trust architectures to supply chain security, I make sure your pipeline doesn't become your weakest link.
Related Articles
Vault Dynamic Secrets: Short-Lived Credentials on Demand
Generate short-lived database credentials, AWS IAM roles, and PKI certificates with Vault dynamic secrets — eliminating long-lived credentials from your infrastructure.
HashiCorp Vault Fundamentals: Installation and First Secrets
Install HashiCorp Vault, understand seal/unseal mechanics, configure secret engines, and store your first secrets with policies and authentication methods.
Vault with Kubernetes: Injecting Secrets into Pods
Inject HashiCorp Vault secrets into Kubernetes pods using the Agent Injector and CSI provider — with practical examples for database credentials and TLS certificates.