DevOpsil
Vault
92%
Fresh

Vault in Production: HA, Auto-Unseal, and Disaster Recovery

Amara OkaforAmara Okafor28 min read

Production Architecture Overview

A production Vault deployment looks nothing like a dev server. You need high availability so that a single node failure does not take down secret access across your entire infrastructure. You need persistent storage so that secrets survive restarts and node replacements. You need auto-unseal so that nodes can recover without human intervention at 3 AM. You need audit logging so that every secret access is recorded for security investigations and compliance requirements. And you need monitoring so that you know about problems before they become outages.

The cost of getting this wrong is high. If Vault goes down, every system that depends on it for credentials stops working. Database connections fail, API calls return authentication errors, TLS certificates cannot be renewed, and CI/CD pipelines halt. Vault is not just another service in your infrastructure; it is a dependency of nearly every other service.

Here is the target architecture for a robust production deployment:

                    +-------------------+
                    |   Load Balancer   |
                    |  (L4 TCP/TLS)     |
                    +--------+----------+
                             |
                +------------+------------+
                |            |            |
          +-----+------+ +--+----+ +-----+------+
          |  Vault 1    | |Vault 2| |  Vault 3    |
          |  (Leader)   | |(Stby) | |  (Stby)     |
          +-----+-------+ +--+----+ +-----+-------+
                |            |            |
                +------------+------------+
                             |
                      +------+------+
                      |  Raft       |
                      |  Storage    |
                      |  (Local SSD)|
                      +------+------+
                             |
                      +------+------+
                      |  Cloud KMS  |
                      | (Auto-Unseal)|
                      +------+------+
                             |
            +----------------+----------------+
            |                |                |
     +------+------+  +-----+------+  +------+------+
     | Audit Log 1 |  | Audit Log 2|  | Telemetry   |
     | (Local File)|  | (Syslog)   |  | (Prometheus)|
     +-------------+  +------------+  +-------------+

Sizing Recommendations

ComponentSmall (dev/staging)Medium (production)Large (enterprise)
Nodes355-7
CPU per node2 cores4 cores8 cores
RAM per node4 GB8 GB16 GB
Storage per node25 GB SSD50 GB SSD100 GB NVMe
NetworkStandardLow latencyDedicated VLAN
Active leasesUp to 10,000Up to 100,000Up to 500,000

Vault is CPU and I/O bound during normal operations. The encryption and decryption of secrets, the Raft consensus protocol, and audit log writes all require consistent low-latency disk I/O. Always use SSDs or NVMe drives for Vault storage and audit logs.

Raft Integrated Storage

Vault supports several storage backends, but Raft integrated storage is now the recommended option for most deployments. It eliminates the need for a separate Consul cluster, reducing operational complexity significantly. With Raft, Vault manages its own distributed consensus, replication, and leader election.

Why Raft Over Consul

FactorRaft Integrated StorageConsul Backend
Operational complexityLower (no separate cluster)Higher (manage Consul too)
Infrastructure costVault nodes onlyVault + Consul nodes (6+ total)
Network requirementsVault-to-Vault onlyVault-to-Consul + Consul-to-Consul
Snapshot/backupBuilt-in vault operator raft snapshotSeparate Consul snapshot process
PerformanceDirect local storageNetwork hop to Consul on every write
HA supportYes (built-in Raft leader election)Yes (via Consul session locking)
Recommended by HashiCorpYes (since Vault 1.4+)Still supported but not preferred
DebuggingSingle system to troubleshootTwo distributed systems to troubleshoot

Full Raft Configuration

# /etc/vault.d/vault.hcl -- Node 1 (initial leader)

# Raft storage backend
storage "raft" {
  path    = "/opt/vault/data"
  node_id = "vault-1"

  # Performance tuning
  performance_multiplier = 1

  # Autopilot configuration for automatic cluster management
  autopilot_reconcile_interval = "10s"

  # Peer discovery: list all other nodes
  retry_join {
    leader_api_addr         = "https://vault-2.internal:8200"
    leader_ca_cert_file     = "/opt/vault/tls/ca.crt"
    leader_client_cert_file = "/opt/vault/tls/vault.crt"
    leader_client_key_file  = "/opt/vault/tls/vault.key"
  }
  retry_join {
    leader_api_addr         = "https://vault-3.internal:8200"
    leader_ca_cert_file     = "/opt/vault/tls/ca.crt"
    leader_client_cert_file = "/opt/vault/tls/vault.crt"
    leader_client_key_file  = "/opt/vault/tls/vault.key"
  }
}

# API listener with TLS
listener "tcp" {
  address         = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"

  # TLS configuration
  tls_cert_file   = "/opt/vault/tls/vault.crt"
  tls_key_file    = "/opt/vault/tls/vault.key"
  tls_min_version = "tls13"

  # Client certificate verification (for mTLS)
  tls_require_and_verify_client_cert = false
  tls_client_ca_file                 = "/opt/vault/tls/ca.crt"

  # Request handling
  max_request_size    = 33554432  # 32 MB
  max_request_duration = "90s"

  # Telemetry
  telemetry {
    unauthenticated_metrics_access = true
  }
}

# Cluster communication addresses
api_addr     = "https://vault-1.internal:8200"
cluster_addr = "https://vault-1.internal:8201"
cluster_name = "vault-prod"

# Web UI
ui = true

# Logging
log_level = "info"
log_file  = "/var/log/vault/vault.log"
log_rotate_duration = "24h"
log_rotate_max_files = 30

# Telemetry for monitoring
telemetry {
  prometheus_retention_time = "30s"
  disable_hostname          = true
  usage_gauge_period        = "10m"
  maximum_gauge_cardinality = 500
}

# Disable memory lock only if the system capability cannot be set
# disable_mlock = true

# Default and maximum lease TTLs
default_lease_ttl = "1h"
max_lease_ttl     = "768h"

Each node gets the same configuration with its own node_id and api_addr. The retry_join blocks tell each node where to find the other members. TLS is configured for both the API listener and the Raft cluster communication.

Cluster Initialization and Formation

# Initialize the first node
export VAULT_ADDR='https://vault-1.internal:8200'
export VAULT_CACERT='/opt/vault/tls/ca.crt'

vault operator init -key-shares=5 -key-threshold=3

# Save the output securely (5 unseal keys + 1 root token)

# Unseal the first node
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3

# The other nodes automatically join via retry_join
# They need to be unsealed as well
export VAULT_ADDR='https://vault-2.internal:8200'
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3

export VAULT_ADDR='https://vault-3.internal:8200'
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3

# Verify the Raft cluster
export VAULT_ADDR='https://vault-1.internal:8200'
vault operator raft list-peers

Expected output:

Node       Address                     State       Voter
----       -------                     -----       -----
vault-1    vault-1.internal:8201       leader      true
vault-2    vault-2.internal:8201       follower    true
vault-3    vault-3.internal:8201       follower    true

Raft Autopilot

Vault includes an autopilot feature for Raft that automates cluster management tasks:

# Check autopilot status
vault operator raft autopilot get-config

# Configure autopilot
vault operator raft autopilot set-config \
  cleanup-dead-servers=true \
  dead-server-last-contact-threshold=24h \
  min-quorum=3 \
  server-stabilization-time=10s

# View cluster health from autopilot's perspective
vault operator raft autopilot state

Autopilot can automatically remove dead servers from the cluster, promoting new voters when a node has been unreachable for longer than the threshold. This reduces manual intervention for common failure scenarios.

Auto-Unseal

Manual unsealing is acceptable for a single development instance, but it is operationally untenable in production. If a node restarts at 2 AM, you do not want to page three operators to provide unseal keys. Auto-unseal delegates the unseal operation to a trusted cloud Key Management Service (KMS). With auto-unseal, the Vault master key is encrypted by the KMS key and stored alongside the encrypted data. When Vault starts, it calls the KMS API to decrypt the master key automatically.

AWS KMS Auto-Unseal

# Add to vault.hcl
seal "awskms" {
  region     = "us-east-1"
  kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234-efgh-5678"

  # Optional: use a specific AWS profile or endpoint
  # access_key = "..." # Prefer IAM roles instead
  # secret_key = "..."
  # endpoint   = "https://kms.us-east-1.amazonaws.com"
}

The Vault server's IAM role needs these permissions on the KMS key:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/abcd-1234-efgh-5678"
    }
  ]
}

Best practices for the KMS key:

  • Use a dedicated KMS key for Vault unsealing (do not share with other services)
  • Enable key rotation on the KMS key (AWS handles this transparently)
  • Restrict access to the KMS key to only the Vault IAM role
  • Enable CloudTrail logging on the KMS key for audit purposes
  • Consider multi-region KMS keys if Vault spans regions

Azure Key Vault Auto-Unseal

seal "azurekeyvault" {
  tenant_id      = "your-tenant-id"
  vault_name     = "vault-unseal-keyvault"
  key_name       = "vault-unseal-key"
  # client_id    = "..." # Use managed identity instead
  # client_secret = "..."
  # environment  = "AZUREPUBLICCLOUD"
}

For Azure, use a Managed Identity assigned to the Vault VM or AKS node pool. The identity needs the following role assignments on the Key Vault:

  • Key Vault Crypto User (for Encrypt/Decrypt operations)
  • Key Vault Reader (for DescribeKey)

GCP Cloud KMS Auto-Unseal

seal "gcpckms" {
  project    = "my-project"
  region     = "global"
  key_ring   = "vault-keyring"
  crypto_key = "vault-unseal-key"
  # credentials = "/path/to/service-account.json" # Use workload identity instead
}

HashiCorp Cloud Platform (HCP) Transit Auto-Unseal

You can also use another Vault cluster (or HCP Vault) for auto-unseal via the Transit secret engine:

seal "transit" {
  address         = "https://hcp-vault.example.com:8200"
  token           = "hvs.transit-unseal-token"
  disable_renewal = false
  key_name        = "autounseal"
  mount_path      = "transit/"
  # tls_ca_cert   = "/path/to/ca.crt"
}

Migrating from Shamir to Auto-Unseal

If you have an existing Vault cluster using Shamir keys and want to migrate to auto-unseal:

# 1. Add the seal stanza to vault.hcl on all nodes
# 2. Stop the Vault service on all nodes
sudo systemctl stop vault

# 3. Start Vault on the leader node first
sudo systemctl start vault

# 4. The node starts in a migration state
# Provide the old Shamir keys with the -migrate flag
vault operator unseal -migrate SHAMIR_KEY_1
vault operator unseal -migrate SHAMIR_KEY_2
vault operator unseal -migrate SHAMIR_KEY_3

# 5. Vault migrates the seal and generates recovery keys
# Save the recovery keys securely

# 6. Start the remaining nodes -- they auto-unseal via KMS
sudo systemctl start vault  # on vault-2
sudo systemctl start vault  # on vault-3

# 7. Verify the migration
vault status
# Seal Type should show "awskms" (or your chosen KMS)

After migration, Vault uses the KMS key for unsealing. The Shamir keys are replaced by recovery keys. Recovery keys cannot unseal Vault but are needed for certain administrative operations like generating a root token.

Audit Devices

Audit logging is non-negotiable in production. Every request and response is logged, including who accessed what secret, when, and from where. Secret values in the audit log are HMAC-hashed using a salt derived from the barrier key, so the log itself does not contain plaintext secrets, but you can compare two HMAC values to determine if the same secret was accessed twice.

Enabling Multiple Audit Devices

Always enable at least two audit devices for redundancy. If all audit devices fail (e.g., disk full for a file audit device), Vault stops responding to all requests. This is a security feature, not a bug. Vault will not serve secrets if it cannot log the access.

# Primary: file-based audit log
vault audit enable -path=file-primary file \
  file_path=/var/log/vault/audit.log \
  log_raw=false \
  hmac_accessor=true \
  mode=0600

# Secondary: syslog for centralized log collection
vault audit enable -path=syslog-backup syslog \
  tag="vault" \
  facility="AUTH"

# Tertiary: socket for real-time log streaming
vault audit enable -path=socket-elk socket \
  address="logstash.internal:9200" \
  socket_type="tcp"

# Verify all audit devices are enabled
vault audit list -detailed

Audit Log Format and Analysis

Each audit entry is a JSON object containing request and response details:

{
  "time": "2026-03-23T10:15:30.123456Z",
  "type": "response",
  "auth": {
    "client_token": "hmac-sha256:abc123def456...",
    "accessor": "hmac-sha256:789ghi012jkl...",
    "display_name": "kubernetes-production-webapp-sa",
    "policies": ["webapp", "default"],
    "token_policies": ["webapp", "default"],
    "metadata": {
      "role": "webapp",
      "service_account_name": "webapp-sa",
      "service_account_namespace": "production",
      "service_account_uid": "12345-abcde-67890"
    },
    "entity_id": "entity-uuid-here",
    "token_type": "service",
    "token_ttl": 3600,
    "token_issue_time": "2026-03-23T09:15:30Z"
  },
  "request": {
    "id": "request-uuid-here",
    "operation": "read",
    "path": "secret/data/webapp/production",
    "remote_address": "10.0.1.45",
    "remote_port": 49152,
    "namespace": {
      "id": "root"
    },
    "wrap_ttl": 0
  },
  "response": {
    "mount_type": "kv",
    "mount_accessor": "kv_abc123",
    "mount_is_external_plugin": false,
    "mount_running_version": "v0.16.1+builtin"
  }
}

This tells you exactly which service account, from which pod IP, read which secret, at what time, and which policies authorized the access.

Audit Log Querying

# Find all accesses to a specific secret path
cat /var/log/vault/audit.log | \
  jq 'select(.request.path == "secret/data/webapp/production")'

# Find all denied requests (for debugging policy issues)
cat /var/log/vault/audit.log | \
  jq 'select(.response.data.error != null) | {time: .time, path: .request.path, error: .response.data.error}'

# Find all requests from a specific IP
cat /var/log/vault/audit.log | \
  jq 'select(.request.remote_address == "10.0.1.45")'

# Count requests per path in the last hour
cat /var/log/vault/audit.log | \
  jq -r 'select(.type == "request") | .request.path' | sort | uniq -c | sort -rn | head -20

# Find all root token usage (should be zero in normal operations)
cat /var/log/vault/audit.log | \
  jq 'select(.auth.policies | index("root"))'

Audit Log Rotation

# Configure logrotate for Vault audit logs
# /etc/logrotate.d/vault
cat > /etc/logrotate.d/vault <<'LOGROTATE'
/var/log/vault/audit.log {
    daily
    rotate 90
    compress
    delaycompress
    missingok
    notifempty
    create 0600 vault vault
    postrotate
        /usr/bin/kill -HUP $(cat /var/run/vault.pid 2>/dev/null) 2>/dev/null || true
    endscript
}
LOGROTATE

For high-volume environments, consider streaming audit logs directly to a centralized logging system (ELK, Splunk, Datadog) rather than writing to local files. This reduces local disk pressure and provides better querying capabilities.

Performance Standby Nodes and Client-Side Caching

In open-source Vault with Raft, standby nodes forward all requests to the leader. This means the leader handles 100% of the request load. To reduce pressure on the leader, use client-side caching through Vault Agent.

Vault Agent Caching

Deploy Vault Agent as a sidecar or DaemonSet that caches responses locally:

# vault-agent-cache.hcl
auto_auth {
  method "kubernetes" {
    mount_path = "auth/kubernetes"
    config = {
      role = "webapp"
    }
  }

  sink "file" {
    config = {
      path = "/home/vault/.vault-token"
    }
  }
}

cache {
  use_auto_auth_token = true
  persist = {
    type = "kubernetes"
    path = "/vault/agent-cache"
  }
}

listener "tcp" {
  address     = "127.0.0.1:8100"
  tls_disable = true
}

Applications connect to the local agent at http://127.0.0.1:8100 instead of directly to the Vault server. The agent caches responses and handles token renewal automatically, reducing the number of requests that reach the Vault cluster.

Performance Standby Nodes (Enterprise)

Vault Enterprise supports performance standby nodes that can serve read requests directly, distributing read load across the entire cluster:

# Enterprise-only feature
# Standby nodes automatically serve reads when this is enabled
# No additional configuration needed beyond the standard HA setup

Batch Tokens for High-Volume Operations

For services that make many short-lived requests, batch tokens reduce storage pressure because they are not persisted:

# Create a batch token (not persisted to storage)
vault token create -type=batch -policy="webapp" -ttl="1h"

Batch tokens are ideal for Kubernetes pods that authenticate once and make a few API calls before terminating.

Vault Telemetry and Monitoring

Vault exposes metrics via a telemetry interface. Configure it to feed into your monitoring stack for proactive alerting.

Prometheus Configuration

# Already in vault.hcl from the Raft configuration section
telemetry {
  prometheus_retention_time = "30s"
  disable_hostname          = true
  usage_gauge_period        = "10m"
  maximum_gauge_cardinality = 500
}
# prometheus-scrape-config.yaml
- job_name: vault
  scheme: https
  tls_config:
    ca_file: /etc/prometheus/vault-ca.crt
  bearer_token_file: /etc/prometheus/vault-metrics-token
  metrics_path: /v1/sys/metrics
  params:
    format: ['prometheus']
  static_configs:
    - targets:
        - vault-1.internal:8200
        - vault-2.internal:8200
        - vault-3.internal:8200
  relabel_configs:
    - source_labels: [__address__]
      target_label: vault_node

Create a dedicated policy and token for Prometheus:

vault policy write prometheus-metrics - <<'EOF'
path "sys/metrics" {
  capabilities = ["read"]
}
EOF

vault token create -policy="prometheus-metrics" -period="768h" -orphan -display-name="prometheus"

Key Metrics and Alert Rules

MetricWarning ThresholdCritical ThresholdMeaning
vault.core.handle_request.durationp99 over 200msp99 over 500msAPI latency increasing
vault.expire.num_leasesover 50,000over 100,000Active lease accumulation
vault.runtime.alloc_bytesSustained increase over 1hover 80% of available RAMPossible memory leak
vault.audit.log_response.durationp99 over 50msp99 over 100msAudit device I/O bottleneck
vault.raft.leader.lastContactover 200msover 500msRaft consensus degraded
vault.core.unsealedN/Aequals 0Node is sealed
vault.raft.applySustained 500/s increaseSustained 1000/s increaseAbnormal write pressure
vault.raft.commitTimep99 over 25msp99 over 100msRaft commit slow (disk I/O)
vault.token.countover 50,000over 100,000Token accumulation
vault.barrier.get.durationp99 over 10msp99 over 50msStorage backend slow

Prometheus Alert Rules

# vault-alerts.yaml
groups:
  - name: vault
    rules:
      - alert: VaultSealed
        expr: vault_core_unsealed == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Vault node is sealed"
          description: "Vault node {{ $labels.instance }} has been sealed for more than 1 minute."

      - alert: VaultHighLatency
        expr: histogram_quantile(0.99, rate(vault_core_handle_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Vault p99 latency is high"
          description: "p99 request latency on {{ $labels.instance }} is {{ $value }}s"

      - alert: VaultLeaseAccumulation
        expr: vault_expire_num_leases > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High number of active leases"
          description: "{{ $labels.instance }} has {{ $value }} active leases"

      - alert: VaultRaftLeaderLost
        expr: vault_raft_leader_lastContact_seconds > 0.5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Raft leader contact degraded"

      - alert: VaultAuditDeviceSlow
        expr: histogram_quantile(0.99, rate(vault_audit_log_response_duration_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Audit device I/O is slow"

Health Check Script

#!/bin/bash
# vault-health-check.sh -- quick cluster health assessment

VAULT_NODES=("vault-1.internal" "vault-2.internal" "vault-3.internal")
CA_CERT="/opt/vault/tls/ca.crt"

echo "=== Vault Cluster Health Check ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""

for node in "${VAULT_NODES[@]}"; do
  RESPONSE=$(curl -sk --cacert "$CA_CERT" "https://${node}:8200/v1/sys/health" -w '\n%{http_code}' 2>/dev/null)
  HTTP_CODE=$(echo "$RESPONSE" | tail -1)
  BODY=$(echo "$RESPONSE" | head -1)

  case $HTTP_CODE in
    200) STATUS="ACTIVE (leader)" ;;
    429) STATUS="STANDBY" ;;
    472) STATUS="DR SECONDARY" ;;
    473) STATUS="PERFORMANCE STANDBY" ;;
    501) STATUS="UNINITIALIZED" ;;
    503) STATUS="SEALED" ;;
    *)   STATUS="UNREACHABLE (HTTP $HTTP_CODE)" ;;
  esac

  VERSION=$(echo "$BODY" | jq -r '.version // "unknown"' 2>/dev/null)
  echo "${node}: ${STATUS} (v${VERSION})"
done

echo ""
echo "=== Raft Peers ==="
vault operator raft list-peers 2>/dev/null || echo "Cannot list peers (not authenticated or not leader)"

echo ""
echo "=== Audit Devices ==="
vault audit list 2>/dev/null || echo "Cannot list audit devices (not authenticated)"

Backup and Restore

Raft Snapshots

Raft snapshots capture the entire Vault state, including all secrets, policies, auth configuration, and engine settings. Snapshots are encrypted with the barrier key, so they are safe to store in external systems, but you need a working Vault cluster (or the unseal keys) to restore them.

# Take a manual snapshot
vault operator raft snapshot save /backup/vault-snapshot-$(date +%Y%m%d-%H%M%S).snap

# Verify a snapshot (check its metadata)
vault operator raft snapshot inspect /backup/vault-snapshot-20260323-060000.snap

# Restore from a snapshot (WARNING: replaces all current data)
vault operator raft snapshot restore /backup/vault-snapshot-20260323-060000.snap

# Force restore (required if the cluster has different seal keys)
vault operator raft snapshot restore -force /backup/vault-snapshot-20260323-060000.snap

Automated Backup Script

#!/bin/bash
set -euo pipefail
# vault-backup.sh -- automated Vault backup

BACKUP_DIR="/backup/vault"
S3_BUCKET="s3://vault-backups/snapshots"
KMS_KEY_ID="arn:aws:kms:us-east-1:123456789012:key/backup-key-id"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE="${BACKUP_DIR}/vault-snapshot-${TIMESTAMP}.snap"

# Ensure backup directory exists
mkdir -p "$BACKUP_DIR"

# Take the snapshot
echo "$(date): Taking Vault snapshot..."
vault operator raft snapshot save "$SNAPSHOT_FILE"
SNAP_SIZE=$(stat --format="%s" "$SNAPSHOT_FILE")
echo "$(date): Snapshot saved (${SNAP_SIZE} bytes): ${SNAPSHOT_FILE}"

# Verify the snapshot
echo "$(date): Verifying snapshot..."
vault operator raft snapshot inspect "$SNAPSHOT_FILE" > /dev/null 2>&1
echo "$(date): Snapshot verification passed"

# Upload to S3 with server-side encryption
echo "$(date): Uploading to S3..."
aws s3 cp "$SNAPSHOT_FILE" \
  "${S3_BUCKET}/${TIMESTAMP}.snap" \
  --sse aws:kms \
  --sse-kms-key-id "$KMS_KEY_ID" \
  --metadata "vault-version=$(vault version -format=json | jq -r '.version'),node=$(hostname)"

# Create a "latest" pointer
aws s3 cp "$SNAPSHOT_FILE" \
  "${S3_BUCKET}/latest.snap" \
  --sse aws:kms \
  --sse-kms-key-id "$KMS_KEY_ID"

# Clean up old local snapshots
echo "$(date): Cleaning up snapshots older than ${RETENTION_DAYS} days..."
find "$BACKUP_DIR" -name "vault-snapshot-*.snap" -mtime +${RETENTION_DAYS} -delete

# Clean up old S3 snapshots (keep 30 days)
aws s3api list-objects-v2 --bucket vault-backups --prefix snapshots/ \
  --query "Contents[?LastModified<='$(date -u -d "${RETENTION_DAYS} days ago" +%Y-%m-%dT%H:%M:%SZ)'].Key" \
  --output text | tr '\t' '\n' | while read key; do
    [ -n "$key" ] && aws s3 rm "s3://vault-backups/$key"
  done

echo "$(date): Backup complete"

Schedule this with cron or a Kubernetes CronJob:

# Crontab entry: backup every 6 hours
0 */6 * * * /usr/local/bin/vault-backup.sh >> /var/log/vault/backup.log 2>&1

Backup Strategy Checklist

  • Take snapshots every 4-6 hours at minimum (more frequently for high-change environments)
  • Store snapshots in at least two geographic locations
  • Encrypt snapshots at rest with a separate KMS key (not the Vault unseal key)
  • Test restores quarterly to a non-production cluster
  • Retain snapshots for at least 30 days (90 days for compliance-heavy environments)
  • Alert on backup failures within one backup cycle
  • Document the restore procedure and keep it accessible outside of Vault

Disaster Recovery

Recovery Scenarios

Scenario 1: Single node failure (most common)

With a three-node Raft cluster, losing one node does not affect availability. The remaining two nodes maintain quorum and elect a new leader if needed. Replace the failed node:

# If the failed node can be recovered, just restart it
sudo systemctl restart vault
# With auto-unseal, the node automatically unseals and rejoins

# If the node is permanently lost, remove it from the cluster
vault operator raft remove-peer vault-3

# Provision a new node with the same vault.hcl configuration
# (update node_id and api_addr)
# Start Vault -- it joins automatically via retry_join
sudo systemctl start vault
# With auto-unseal, it unseals and syncs data from the leader

# Verify the cluster
vault operator raft list-peers

Scenario 2: Quorum loss (majority of nodes lost)

If you lose two out of three nodes, the cluster cannot elect a leader and all operations stop. This requires a snapshot restore:

# 1. Stop all remaining Vault processes
sudo systemctl stop vault  # on all nodes

# 2. On the node that will become the new leader, clean the data directory
sudo rm -rf /opt/vault/data/*

# 3. Start Vault on that node only
sudo systemctl start vault

# 4. Initialize a new single-node cluster
vault operator init -key-shares=1 -key-threshold=1
# Or with auto-unseal, it initializes with recovery keys

# 5. Unseal the node
vault operator unseal UNSEAL_KEY

# 6. Restore the latest snapshot
vault operator raft snapshot restore -force /backup/latest.snap

# 7. The restore replaces all data including seal configuration
# You may need to unseal again after restore

# 8. Start the other nodes -- they join and sync via retry_join
sudo systemctl start vault  # on vault-2
sudo systemctl start vault  # on vault-3

# 9. Verify the cluster
vault operator raft list-peers
vault status

Scenario 3: Complete infrastructure loss

Everything is gone. Recover from off-site backup:

# 1. Pull the latest snapshot from S3
aws s3 cp s3://vault-backups/snapshots/latest.snap /tmp/vault-restore.snap

# 2. Provision new infrastructure (VMs, networking, TLS certs)
# 3. Install and configure Vault on the first node

# 4. Initialize and restore
vault operator init
vault operator unseal  # provide keys
vault operator raft snapshot restore -force /tmp/vault-restore.snap

# 5. Join additional nodes
# Start vault-2 and vault-3 -- retry_join handles the rest

# 6. Verify everything
vault operator raft list-peers
vault secrets list
vault auth list
vault audit list

Scenario 4: KMS key unavailable (auto-unseal failure)

If the cloud KMS key used for auto-unseal becomes unavailable:

# 1. Check KMS connectivity
# For AWS:
aws kms describe-key --key-id "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"

# 2. If the KMS key is temporarily unavailable, wait for the cloud provider to resolve
# Vault will automatically unseal once KMS access is restored

# 3. If the KMS key is permanently lost (deleted, access revoked):
# You need the recovery keys to generate a new root token
vault operator generate-root -init
vault operator generate-root -otp="..." -nonce="..." RECOVERY_KEY_1
vault operator generate-root -otp="..." -nonce="..." RECOVERY_KEY_2
vault operator generate-root -otp="..." -nonce="..." RECOVERY_KEY_3

# 4. Migrate to a new KMS key:
# Update vault.hcl with the new KMS key
# Restart Vault with -migrate flag (similar to Shamir migration)

Disaster Recovery Replication (Enterprise)

Vault Enterprise supports DR replication, where a secondary cluster receives a continuous stream of data from the primary:

# On the primary cluster
vault write -f sys/replication/dr/primary/enable

# Generate a secondary activation token
vault write sys/replication/dr/primary/secondary-token id="dr-secondary"

# On the secondary cluster
vault write sys/replication/dr/secondary/enable token="SECONDARY_TOKEN_HERE"

# Verify replication status
vault read sys/replication/dr/status

The DR secondary is a hot standby. In a disaster, promote it to primary:

# Generate a DR operation token using recovery keys
vault operator generate-root -dr-token -init
vault operator generate-root -dr-token RECOVERY_KEY_1
vault operator generate-root -dr-token RECOVERY_KEY_2
vault operator generate-root -dr-token RECOVERY_KEY_3

# Promote the secondary
vault write sys/replication/dr/secondary/promote dr_operation_token="DR_TOKEN_HERE"

Security Hardening Checklist

TLS Configuration

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/opt/vault/tls/vault-full-chain.crt"
  tls_key_file  = "/opt/vault/tls/vault.key"
  tls_min_version = "tls13"

  # Recommended cipher suites for TLS 1.2 (if TLS 1.3 is not universally supported)
  # tls_cipher_suites = "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"

  # Enable mTLS for client verification
  tls_require_and_verify_client_cert = false  # Set to true for mTLS
  tls_client_ca_file = "/opt/vault/tls/client-ca.crt"

  # Disable HTTP/2 if not needed (reduces attack surface)
  # http2_enable = false
}

Network Policies (Kubernetes)

# vault-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vault-server
  namespace: vault
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: vault
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow API traffic from application namespaces
    - from:
        - namespaceSelector:
            matchLabels:
              vault-access: "true"
      ports:
        - port: 8200
          protocol: TCP
    # Allow Raft cluster communication between Vault nodes
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: vault
      ports:
        - port: 8201
          protocol: TCP
    # Allow Prometheus scraping
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - port: 8200
          protocol: TCP
  egress:
    # Allow Raft cluster communication
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: vault
      ports:
        - port: 8201
          protocol: TCP
    # Allow DNS resolution
    - to: []
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
    # Allow KMS for auto-unseal
    - to: []
      ports:
        - port: 443
          protocol: TCP
    # Allow database connections for dynamic secrets
    - to: []
      ports:
        - port: 5432
          protocol: TCP
        - port: 3306
          protocol: TCP

Linux System Hardening

# Create the vault user with minimal permissions
sudo useradd --system --home /opt/vault --shell /usr/sbin/nologin vault

# Set file permissions
sudo chown -R vault:vault /opt/vault
sudo chmod 700 /opt/vault/data
sudo chmod 600 /opt/vault/tls/vault.key
sudo chmod 644 /opt/vault/tls/vault.crt

# Enable memory locking (prevent secrets from being swapped to disk)
sudo setcap cap_ipc_lock=+ep /usr/local/bin/vault

# Restrict SSH access to Vault nodes
# Use bastion host or VPN-only access

# Enable kernel hardening
echo "kernel.dmesg_restrict = 1" | sudo tee -a /etc/sysctl.d/vault.conf
echo "kernel.kptr_restrict = 2" | sudo tee -a /etc/sysctl.d/vault.conf
echo "net.ipv4.conf.all.send_redirects = 0" | sudo tee -a /etc/sysctl.d/vault.conf
sudo sysctl --system

Least-Privilege Practices

Create an admin policy that can manage Vault without having root access:

vault policy write vault-admin - <<'EOF'
# Manage secrets engines
path "sys/mounts/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}
path "sys/mounts" {
  capabilities = ["read", "list"]
}

# Manage policies
path "sys/policies/acl/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}
path "sys/policies/acl" {
  capabilities = ["list"]
}

# Manage auth methods
path "sys/auth/*" {
  capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
path "sys/auth" {
  capabilities = ["read", "list"]
}

# Manage audit devices
path "sys/audit/*" {
  capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
path "sys/audit" {
  capabilities = ["read", "list"]
}

# View system health and leader status
path "sys/health" {
  capabilities = ["read"]
}
path "sys/leader" {
  capabilities = ["read"]
}

# Manage leases
path "sys/leases/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}

# Read metrics
path "sys/metrics" {
  capabilities = ["read"]
}

# Raft operations
path "sys/storage/raft/*" {
  capabilities = ["read", "list"]
}

# DENY direct access to application secrets
# Admins manage infrastructure, not application data
path "secret/*" {
  capabilities = ["deny"]
}
path "database/*" {
  capabilities = ["deny"]
}
EOF

# Revoke the root token after creating admin access
vault token revoke ROOT_TOKEN_HERE

Root token generation should be used only for initial setup. After creating admin policies and users, revoke the root token. If you need root access later, generate a new root token using the recovery keys (or unseal keys):

# Generate a new root token (requires threshold of recovery/unseal keys)
vault operator generate-root -init
# Follow the prompts to provide recovery keys

Operational Runbooks

Runbook: Sealed Node

# 1. Check seal status
vault status

# 2. If using auto-unseal, check KMS connectivity
# AWS:
aws kms describe-key --key-id "YOUR_KMS_KEY_ARN"
# Azure:
az keyvault key show --vault-name "vault-unseal" --name "unseal-key"
# GCP:
gcloud kms keys describe vault-unseal-key --location global --keyring vault-keyring

# 3. Check Vault logs for seal-related errors
journalctl -u vault -n 100 --no-pager | grep -i "seal\|unseal\|kms"

# 4. If KMS is healthy, restart the Vault service
sudo systemctl restart vault

# 5. Monitor for automatic unseal (wait 30 seconds)
sleep 30 && vault status

# 6. If auto-unseal fails, check IAM permissions
# Verify the Vault process has access to the KMS key

# 7. Last resort: manual unseal with recovery keys (if migrated from Shamir)
# This should not be needed with auto-unseal unless KMS is permanently unavailable

Runbook: Leader Election Failure

# 1. Check Raft cluster status
vault operator raft list-peers

# 2. Check if quorum is maintained (need majority of nodes)
# 3-node cluster needs 2 nodes
# 5-node cluster needs 3 nodes

# 3. If a node is unreachable, check its status
curl -sk https://vault-3.internal:8200/v1/sys/health

# 4. If the node is permanently down, remove it from the cluster
vault operator raft remove-peer vault-3

# 5. If the cluster lost quorum entirely:
# Follow Scenario 2 from the Disaster Recovery section

# 6. Verify cluster health after resolution
vault operator raft list-peers
vault operator raft autopilot state
vault status

Runbook: High Latency

# 1. Check system-level metrics
top -bn1 | head -20
iostat -x 1 3
free -h

# 2. Check Vault metrics
curl -s --header "X-Vault-Token: $VAULT_TOKEN" \
  "${VAULT_ADDR}/v1/sys/metrics?format=json" | \
  jq '.Gauges[] | select(.Name | contains("runtime"))'

# 3. Check active lease count (high lease count causes performance degradation)
vault read -format=json sys/metrics | \
  jq '.data.Gauges[] | select(.Name == "vault.expire.num_leases")'

# 4. If lease count is high, identify and revoke unnecessary leases
vault list -format=json sys/leases/lookup/database/ 2>/dev/null
vault lease revoke -prefix database/creds/unused-role/

# 5. Check audit device performance (slow disk causes request delays)
ls -la /var/log/vault/audit.log
df -h /var/log/
iostat -x 1 1 | grep -E "Device|$(findmnt -n -o SOURCE /var/log | xargs basename)"

# 6. If audit log file is too large, rotate it
sudo logrotate -f /etc/logrotate.d/vault

# 7. Check Raft commit times (indicates storage backend performance)
curl -s --header "X-Vault-Token: $VAULT_TOKEN" \
  "${VAULT_ADDR}/v1/sys/metrics?format=json" | \
  jq '.Summaries[] | select(.Name | contains("raft.commitTime"))'

Runbook: Emergency Seal

If you suspect a breach or unauthorized access:

# 1. Seal Vault immediately -- this stops ALL secret access
vault operator seal

# 2. This is a deliberate outage -- inform your team immediately

# 3. Investigate the audit logs (from backup copies, since Vault is sealed)
# Copy audit log to investigation workstation
cp /var/log/vault/audit.log /tmp/investigation/

# 4. Search for suspicious activity
cat /tmp/investigation/audit.log | \
  jq 'select(.auth.display_name == "suspicious-identity")' > suspicious.json

cat /tmp/investigation/audit.log | \
  jq 'select(.request.remote_address | startswith("unknown-range"))' >> suspicious.json

# 5. Identify compromised tokens and prepare revocation commands
cat suspicious.json | jq -r '.auth.accessor' | sort -u > compromised-accessors.txt

# 6. Unseal when investigation is complete and remediation is planned
vault operator unseal UNSEAL_KEY_1
vault operator unseal UNSEAL_KEY_2
vault operator unseal UNSEAL_KEY_3

# 7. Revoke compromised tokens
while read accessor; do
  vault token revoke -accessor "$accessor"
done < compromised-accessors.txt

# 8. Revoke leases associated with compromised identities
vault lease revoke -prefix COMPROMISED_PATH/

# 9. Rotate any secrets that may have been exposed
vault write -f database/config/myapp-db/rotate-root

# 10. Document the incident and update runbooks

Runbook: Vault Upgrade

# 1. Read the changelog and upgrade notes for the target version
# 2. Test the upgrade in a non-production environment first

# 3. Take a snapshot before upgrading
vault operator raft snapshot save /backup/pre-upgrade-$(date +%Y%m%d).snap

# 4. Upgrade standby nodes first
# On vault-2 (standby):
sudo systemctl stop vault
sudo cp /usr/local/bin/vault /usr/local/bin/vault.bak
sudo cp /tmp/vault-new-version /usr/local/bin/vault
sudo systemctl start vault

# Verify the standby node is healthy
curl -sk https://vault-2.internal:8200/v1/sys/health

# 5. Repeat for vault-3 (standby)

# 6. Step down the leader to trigger failover to an upgraded node
vault operator step-down

# 7. Upgrade the old leader (now a standby)
# On vault-1:
sudo systemctl stop vault
sudo cp /usr/local/bin/vault /usr/local/bin/vault.bak
sudo cp /tmp/vault-new-version /usr/local/bin/vault
sudo systemctl start vault

# 8. Verify all nodes are running the new version
for node in vault-{1,2,3}.internal; do
  echo "${node}: $(curl -sk https://${node}:8200/v1/sys/health | jq -r '.version')"
done

# 9. Verify cluster health
vault operator raft list-peers
vault status

Summary

Running Vault in production requires planning across several dimensions: storage (Raft integrated storage for simplicity and performance), availability (three or five-node cluster for quorum tolerance), unsealing (cloud KMS auto-unseal for zero-touch recovery), observability (dual audit devices plus Prometheus telemetry), backup (automated Raft snapshots stored off-site), and security (TLS 1.3, network policies, least-privilege policies, revoked root tokens). Treat your Vault cluster as the most critical piece of infrastructure you operate because if Vault goes down, every system that depends on it for credentials will follow. Build the automation, monitoring, and runbooks before you need them, because the middle of an incident is the wrong time to figure out your recovery procedure. Start with a three-node Raft cluster, enable auto-unseal with cloud KMS, configure at least two audit devices, automate snapshots to run every six hours, deploy Prometheus alerting on key metrics, and keep your runbooks in a location that does not require Vault access to read them.

Share:
Amara Okafor
Amara Okafor

DevSecOps Lead

Security-first mindset in everything I ship. From zero-trust architectures to supply chain security, I make sure your pipeline doesn't become your weakest link.

Related Articles