DevOpsil
Terraform
88%
Fresh

Terraform Remote State: S3 Backends, Locking, Workspaces, and State Surgery

Zara BlackwoodZara Blackwood9 min read

State Is the Source of Truth. Treat It That Way.

Your Terraform state file is the single most critical artifact in your infrastructure pipeline. It maps every resource Terraform manages to real cloud objects. Lose it, corrupt it, or let two engineers write to it simultaneously — and you're in for a very bad day.

Local state is a toy. If you're running terraform apply with state sitting on your laptop, you're one rm -rf away from orphaned infrastructure nobody can manage. Let's fix that.

Setting Up the S3 Backend

First, you need the backend infrastructure itself. Yes, this is the chicken-and-egg problem of IaC — you need infrastructure to store the state that manages your infrastructure.

Bootstrap Module

bootstrap/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars
# bootstrap/main.tf

resource "aws_s3_bucket" "state" {
  bucket = "${var.org_name}-terraform-state"

  tags = {
    ManagedBy = "terraform-bootstrap"
    Purpose   = "terraform-state"
  }
}

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "state" {
  bucket = aws_s3_bucket.state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.state.arn
    }
    bucket_key_enabled = true
  }
}

resource "aws_s3_bucket_public_access_block" "state" {
  bucket = aws_s3_bucket.state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_kms_key" "state" {
  description             = "KMS key for Terraform state encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_dynamodb_table" "locks" {
  name         = "${var.org_name}-terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    ManagedBy = "terraform-bootstrap"
    Purpose   = "terraform-state-locking"
  }
}

KMS encryption, versioning, public access blocked, and DynamoDB for locking. This is the minimum. Apply this with local state, then migrate.

Configuring the Backend

# backend.tf
terraform {
  backend "s3" {
    bucket         = "acme-terraform-state"
    key            = "networking/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "alias/terraform-state"
    dynamodb_table = "acme-terraform-locks"
  }
}

After adding this, run:

terraform init -migrate-state

Terraform copies your local state to S3. Verify it worked, then delete the local .tfstate file. Don't skip verification.

State Locking: Why DynamoDB Matters

Without locking, this happens:

  1. Engineer A runs terraform plan — sees 3 changes
  2. Engineer B runs terraform plan — sees the same 3 changes
  3. Both run terraform apply at the same time
  4. One apply succeeds, the other corrupts state or creates duplicate resources

DynamoDB locking prevents concurrent writes. When Terraform acquires a lock, it writes a record to the DynamoDB table. Any other apply attempt blocks until the lock is released.

# Lock stuck after a crashed apply?
terraform force-unlock LOCK_ID

# Get the lock ID from the error message. ALWAYS investigate why
# the lock was stuck before force-unlocking.

State Key Strategy

Your key path in the backend config determines how state files are organized in S3. Here's the pattern I use:

s3://acme-terraform-state/
├── networking/
│   ├── vpc/terraform.tfstate
│   └── dns/terraform.tfstate
├── compute/
│   ├── eks/terraform.tfstate
│   └── ec2-bastion/terraform.tfstate
├── data/
│   ├── rds-primary/terraform.tfstate
│   └── elasticache/terraform.tfstate
└── security/
    ├── iam/terraform.tfstate
    └── waf/terraform.tfstate

One state file per logical component. Small blast radius. If an apply goes wrong on your WAF config, your VPC state is untouched.

Workspaces: When They Work and When They Don't

Workspaces create isolated state files within the same backend config. Terraform stores them under env:/ prefixes in S3.

terraform workspace new staging
terraform workspace new prod
terraform workspace select staging
# Using workspace name in resource configuration
locals {
  env = terraform.workspace

  instance_type = {
    dev     = "t3.small"
    staging = "t3.medium"
    prod    = "m5.large"
  }
}

resource "aws_instance" "app" {
  instance_type = local.instance_type[local.env]

  tags = {
    Environment = local.env
  }
}

When workspaces work

  • Same infrastructure, different sizes per environment
  • Small teams where everyone understands the workspace model
  • Ephemeral environments for feature branches

When workspaces fail

  • Different environments need fundamentally different resources
  • Teams larger than ~10 engineers (workspace confusion is real)
  • When you need different backend configs per environment

For most production setups, I prefer directory-based separation over workspaces:

environments/
├── dev/
│   ├── backend.tf    # key = "dev/app/terraform.tfstate"
│   ├── main.tf
│   └── terraform.tfvars
├── staging/
│   ├── backend.tf    # key = "staging/app/terraform.tfstate"
│   ├── main.tf
│   └── terraform.tfvars
└── prod/
    ├── backend.tf    # key = "prod/app/terraform.tfstate"
    ├── main.tf
    └── terraform.tfvars

Explicit. Visible. No hidden terraform.workspace magic.

State Surgery: The Emergency Toolkit

Sometimes state gets out of sync with reality. These commands are your scalpel.

# View everything in state
terraform state list

# See details of a specific resource
terraform state show aws_s3_bucket.data

# Remove a resource from state WITHOUT destroying it
# Use this when you want Terraform to "forget" a resource
terraform state rm aws_s3_bucket.legacy

# Move a resource to a new address (after refactoring)
terraform state mv aws_instance.old aws_instance.new

# Move a resource into a module
terraform state mv aws_vpc.main module.networking.aws_vpc.this

# Import an existing resource into state
terraform import aws_s3_bucket.existing my-bucket-name

The moved Block (Terraform 1.1+)

Instead of manual state mv commands, declare moves in code:

moved {
  from = aws_instance.app
  to   = module.compute.aws_instance.app
}

This is refactoring as code. It goes through plan/apply, it's reviewable in a PR, and it's self-documenting. Always prefer moved blocks over manual state surgery.

Recovering from Disaster

S3 versioning is your safety net. If state gets corrupted:

# List state file versions
aws s3api list-object-versions \
  --bucket acme-terraform-state \
  --prefix networking/vpc/terraform.tfstate

# Download a previous version
aws s3api get-object \
  --bucket acme-terraform-state \
  --key networking/vpc/terraform.tfstate \
  --version-id "abc123" \
  recovered.tfstate

# Push the recovered state
terraform state push recovered.tfstate

This is why versioning on the state bucket is non-negotiable.

CI/CD Pipeline for State Operations

Never run terraform apply from a laptop in production. Use a CI pipeline with proper access controls.

# .github/workflows/terraform.yml
name: Terraform
on:
  push:
    branches: [main]
    paths: ['infrastructure/**']
  pull_request:
    paths: ['infrastructure/**']

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.8.0"

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
          aws-region: us-east-1

      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/networking

      - name: Terraform Plan
        id: plan
        run: terraform plan -no-color -out=tfplan
        working-directory: infrastructure/networking

      - name: Comment PR with Plan
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan
            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\`
            *Pushed by: @${{ github.actor }}*`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            })

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-apply
          aws-region: us-east-1

      - run: terraform init && terraform apply -auto-approve
        working-directory: infrastructure/networking

Two IAM roles: terraform-plan has read-only access, terraform-apply has write access. The plan role is used for PRs. The apply role is locked behind a GitHub environment with required reviewers.

State File Security

Your state file contains sensitive data — database passwords, API keys, resource ARNs. Treat it accordingly.

IAM Policy for State Access

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowStateBucketAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::acme-terraform-state/*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalTag/Team": "${s3:prefix}"
        }
      }
    },
    {
      "Sid": "AllowLockTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:GetItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/acme-terraform-locks"
    }
  ]
}

Tag-based access control: the networking team can only access state files under the networking/ prefix. The payments team can only access payments/. No one accidentally destroys another team's infrastructure.

Detecting State Drift

State drift happens when someone modifies infrastructure outside of Terraform. Detect it early.

#!/bin/bash
# drift-detection.sh — run on a schedule

MODULES=("networking/vpc" "compute/eks" "data/rds-primary")

for module in "${MODULES[@]}"; do
  echo "Checking drift for: $module"
  cd "infrastructure/$module"
  terraform init -input=false > /dev/null 2>&1
  PLAN_OUTPUT=$(terraform plan -detailed-exitcode -no-color 2>&1)
  EXIT_CODE=$?

  if [ $EXIT_CODE -eq 2 ]; then
    echo "DRIFT DETECTED in $module"
    # Send alert
    curl -X POST "$SLACK_WEBHOOK" \
      -H 'Content-Type: application/json' \
      -d "{\"text\":\"Terraform drift detected in \`$module\`. Run \`terraform plan\` to review.\"}"
  elif [ $EXIT_CODE -eq 0 ]; then
    echo "No drift in $module"
  else
    echo "ERROR checking $module"
  fi
  cd -
done

Schedule this daily. terraform plan -detailed-exitcode returns exit code 2 when there are changes, making it scriptable. Catching drift early prevents the "someone changed this in the console and now my plan shows 47 changes" nightmare.

Common Pitfalls

Pitfall 1: Storing sensitive outputs in state. Terraform stores all outputs in state as plaintext. If you output a database password, it's readable by anyone with state access. Use sensitive = true on outputs to prevent them from showing in logs, but know they're still in the state file.

output "db_password" {
  value     = random_password.db.result
  sensitive = true
}

Pitfall 2: Running terraform state rm instead of moved blocks. Manual state operations are one-shot and unauditable. moved blocks are code-reviewed, reversible, and self-documenting. Always prefer moved blocks.

Pitfall 3: Migrating state without verifying. After terraform init -migrate-state, always run terraform plan to confirm zero changes. If the plan shows changes, the migration went wrong.

Pitfall 4: Sharing state across modules. One module's state file should never be writable by another module's pipeline. Use terraform_remote_state data sources for read-only cross-module references.

Conclusion

Remote state is not optional — it's the foundation of collaborative IaC. Set up S3 with KMS encryption, enable versioning, add DynamoDB locking, and organize your state keys by domain. Use workspaces only when they genuinely simplify your setup, and keep moved blocks and state mv in your back pocket for when refactoring day comes. Run drift detection on a schedule, lock state access with IAM policies, and run terraform apply only from CI. Your state file is your infrastructure's memory. Protect it like production data, because that's exactly what it is.

Share:
Zara Blackwood
Zara Blackwood

Platform Engineer

Terraform enthusiast, platform builder, DRY advocate. I believe infrastructure should be versioned, reviewed, and deployed like any other code. GitOps or bust.

Related Articles

TerraformQuick RefFresh

Terraform CLI: Cheat Sheet

Terraform CLI cheat sheet with commands organized by workflow — init, plan, apply, destroy, state manipulation, imports, and workspace management.

Zara Blackwood·
3 min read