Terraform Remote State: S3 Backends, Locking, Workspaces, and State Surgery
State Is the Source of Truth. Treat It That Way.
Your Terraform state file is the single most critical artifact in your infrastructure pipeline. It maps every resource Terraform manages to real cloud objects. Lose it, corrupt it, or let two engineers write to it simultaneously — and you're in for a very bad day.
Local state is a toy. If you're running terraform apply with state sitting on your laptop, you're one rm -rf away from orphaned infrastructure nobody can manage. Let's fix that.
Setting Up the S3 Backend
First, you need the backend infrastructure itself. Yes, this is the chicken-and-egg problem of IaC — you need infrastructure to store the state that manages your infrastructure.
Bootstrap Module
bootstrap/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars
# bootstrap/main.tf
resource "aws_s3_bucket" "state" {
bucket = "${var.org_name}-terraform-state"
tags = {
ManagedBy = "terraform-bootstrap"
Purpose = "terraform-state"
}
}
resource "aws_s3_bucket_versioning" "state" {
bucket = aws_s3_bucket.state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "state" {
bucket = aws_s3_bucket.state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.state.arn
}
bucket_key_enabled = true
}
}
resource "aws_s3_bucket_public_access_block" "state" {
bucket = aws_s3_bucket.state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_kms_key" "state" {
description = "KMS key for Terraform state encryption"
deletion_window_in_days = 30
enable_key_rotation = true
}
resource "aws_dynamodb_table" "locks" {
name = "${var.org_name}-terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
ManagedBy = "terraform-bootstrap"
Purpose = "terraform-state-locking"
}
}
KMS encryption, versioning, public access blocked, and DynamoDB for locking. This is the minimum. Apply this with local state, then migrate.
Configuring the Backend
# backend.tf
terraform {
backend "s3" {
bucket = "acme-terraform-state"
key = "networking/vpc/terraform.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "alias/terraform-state"
dynamodb_table = "acme-terraform-locks"
}
}
After adding this, run:
terraform init -migrate-state
Terraform copies your local state to S3. Verify it worked, then delete the local .tfstate file. Don't skip verification.
State Locking: Why DynamoDB Matters
Without locking, this happens:
- Engineer A runs
terraform plan— sees 3 changes - Engineer B runs
terraform plan— sees the same 3 changes - Both run
terraform applyat the same time - One apply succeeds, the other corrupts state or creates duplicate resources
DynamoDB locking prevents concurrent writes. When Terraform acquires a lock, it writes a record to the DynamoDB table. Any other apply attempt blocks until the lock is released.
# Lock stuck after a crashed apply?
terraform force-unlock LOCK_ID
# Get the lock ID from the error message. ALWAYS investigate why
# the lock was stuck before force-unlocking.
State Key Strategy
Your key path in the backend config determines how state files are organized in S3. Here's the pattern I use:
s3://acme-terraform-state/
├── networking/
│ ├── vpc/terraform.tfstate
│ └── dns/terraform.tfstate
├── compute/
│ ├── eks/terraform.tfstate
│ └── ec2-bastion/terraform.tfstate
├── data/
│ ├── rds-primary/terraform.tfstate
│ └── elasticache/terraform.tfstate
└── security/
├── iam/terraform.tfstate
└── waf/terraform.tfstate
One state file per logical component. Small blast radius. If an apply goes wrong on your WAF config, your VPC state is untouched.
Workspaces: When They Work and When They Don't
Workspaces create isolated state files within the same backend config. Terraform stores them under env:/ prefixes in S3.
terraform workspace new staging
terraform workspace new prod
terraform workspace select staging
# Using workspace name in resource configuration
locals {
env = terraform.workspace
instance_type = {
dev = "t3.small"
staging = "t3.medium"
prod = "m5.large"
}
}
resource "aws_instance" "app" {
instance_type = local.instance_type[local.env]
tags = {
Environment = local.env
}
}
When workspaces work
- Same infrastructure, different sizes per environment
- Small teams where everyone understands the workspace model
- Ephemeral environments for feature branches
When workspaces fail
- Different environments need fundamentally different resources
- Teams larger than ~10 engineers (workspace confusion is real)
- When you need different backend configs per environment
For most production setups, I prefer directory-based separation over workspaces:
environments/
├── dev/
│ ├── backend.tf # key = "dev/app/terraform.tfstate"
│ ├── main.tf
│ └── terraform.tfvars
├── staging/
│ ├── backend.tf # key = "staging/app/terraform.tfstate"
│ ├── main.tf
│ └── terraform.tfvars
└── prod/
├── backend.tf # key = "prod/app/terraform.tfstate"
├── main.tf
└── terraform.tfvars
Explicit. Visible. No hidden terraform.workspace magic.
State Surgery: The Emergency Toolkit
Sometimes state gets out of sync with reality. These commands are your scalpel.
# View everything in state
terraform state list
# See details of a specific resource
terraform state show aws_s3_bucket.data
# Remove a resource from state WITHOUT destroying it
# Use this when you want Terraform to "forget" a resource
terraform state rm aws_s3_bucket.legacy
# Move a resource to a new address (after refactoring)
terraform state mv aws_instance.old aws_instance.new
# Move a resource into a module
terraform state mv aws_vpc.main module.networking.aws_vpc.this
# Import an existing resource into state
terraform import aws_s3_bucket.existing my-bucket-name
The moved Block (Terraform 1.1+)
Instead of manual state mv commands, declare moves in code:
moved {
from = aws_instance.app
to = module.compute.aws_instance.app
}
This is refactoring as code. It goes through plan/apply, it's reviewable in a PR, and it's self-documenting. Always prefer moved blocks over manual state surgery.
Recovering from Disaster
S3 versioning is your safety net. If state gets corrupted:
# List state file versions
aws s3api list-object-versions \
--bucket acme-terraform-state \
--prefix networking/vpc/terraform.tfstate
# Download a previous version
aws s3api get-object \
--bucket acme-terraform-state \
--key networking/vpc/terraform.tfstate \
--version-id "abc123" \
recovered.tfstate
# Push the recovered state
terraform state push recovered.tfstate
This is why versioning on the state bucket is non-negotiable.
CI/CD Pipeline for State Operations
Never run terraform apply from a laptop in production. Use a CI pipeline with proper access controls.
# .github/workflows/terraform.yml
name: Terraform
on:
push:
branches: [main]
paths: ['infrastructure/**']
pull_request:
paths: ['infrastructure/**']
permissions:
contents: read
id-token: write
pull-requests: write
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.8.0"
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
aws-region: us-east-1
- name: Terraform Init
run: terraform init
working-directory: infrastructure/networking
- name: Terraform Plan
id: plan
run: terraform plan -no-color -out=tfplan
working-directory: infrastructure/networking
- name: Comment PR with Plan
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const output = `#### Terraform Plan
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\`
*Pushed by: @${{ github.actor }}*`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
})
apply:
needs: plan
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-apply
aws-region: us-east-1
- run: terraform init && terraform apply -auto-approve
working-directory: infrastructure/networking
Two IAM roles: terraform-plan has read-only access, terraform-apply has write access. The plan role is used for PRs. The apply role is locked behind a GitHub environment with required reviewers.
State File Security
Your state file contains sensitive data — database passwords, API keys, resource ARNs. Treat it accordingly.
IAM Policy for State Access
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowStateBucketAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::acme-terraform-state/*",
"Condition": {
"StringEquals": {
"aws:PrincipalTag/Team": "${s3:prefix}"
}
}
},
{
"Sid": "AllowLockTable",
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:GetItem",
"dynamodb:DeleteItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/acme-terraform-locks"
}
]
}
Tag-based access control: the networking team can only access state files under the networking/ prefix. The payments team can only access payments/. No one accidentally destroys another team's infrastructure.
Detecting State Drift
State drift happens when someone modifies infrastructure outside of Terraform. Detect it early.
#!/bin/bash
# drift-detection.sh — run on a schedule
MODULES=("networking/vpc" "compute/eks" "data/rds-primary")
for module in "${MODULES[@]}"; do
echo "Checking drift for: $module"
cd "infrastructure/$module"
terraform init -input=false > /dev/null 2>&1
PLAN_OUTPUT=$(terraform plan -detailed-exitcode -no-color 2>&1)
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "DRIFT DETECTED in $module"
# Send alert
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"Terraform drift detected in \`$module\`. Run \`terraform plan\` to review.\"}"
elif [ $EXIT_CODE -eq 0 ]; then
echo "No drift in $module"
else
echo "ERROR checking $module"
fi
cd -
done
Schedule this daily. terraform plan -detailed-exitcode returns exit code 2 when there are changes, making it scriptable. Catching drift early prevents the "someone changed this in the console and now my plan shows 47 changes" nightmare.
Common Pitfalls
Pitfall 1: Storing sensitive outputs in state. Terraform stores all outputs in state as plaintext. If you output a database password, it's readable by anyone with state access. Use sensitive = true on outputs to prevent them from showing in logs, but know they're still in the state file.
output "db_password" {
value = random_password.db.result
sensitive = true
}
Pitfall 2: Running terraform state rm instead of moved blocks. Manual state operations are one-shot and unauditable. moved blocks are code-reviewed, reversible, and self-documenting. Always prefer moved blocks.
Pitfall 3: Migrating state without verifying. After terraform init -migrate-state, always run terraform plan to confirm zero changes. If the plan shows changes, the migration went wrong.
Pitfall 4: Sharing state across modules. One module's state file should never be writable by another module's pipeline. Use terraform_remote_state data sources for read-only cross-module references.
Conclusion
Remote state is not optional — it's the foundation of collaborative IaC. Set up S3 with KMS encryption, enable versioning, add DynamoDB locking, and organize your state keys by domain. Use workspaces only when they genuinely simplify your setup, and keep moved blocks and state mv in your back pocket for when refactoring day comes. Run drift detection on a schedule, lock state access with IAM policies, and run terraform apply only from CI. Your state file is your infrastructure's memory. Protect it like production data, because that's exactly what it is.
Related Articles
Platform Engineer
Terraform enthusiast, platform builder, DRY advocate. I believe infrastructure should be versioned, reviewed, and deployed like any other code. GitOps or bust.
Related Articles
Terraform from Zero to Production: Project Structure, Modules, State, and CI/CD
Build production-grade Terraform infrastructure — project structure, module design, state management, testing, and CI/CD pipeline integration.
Terraform CLI: Cheat Sheet
Terraform CLI cheat sheet with commands organized by workflow — init, plan, apply, destroy, state manipulation, imports, and workspace management.
Testing Terraform with Terratest: A Practical Guide
How to write unit and integration tests for Terraform modules using Terratest — because untested infrastructure is a liability.