DevOpsil
Terraform
84%
Fresh

Terraform from Zero to Production: Project Structure, Modules, State, and CI/CD

Zara BlackwoodZara Blackwood15 min read

Infrastructure That Isn't in Code Doesn't Exist

I've said this before and I'll keep saying it: if your infrastructure isn't versioned, reviewed, and deployed through a pipeline, it's a liability. ClickOps is technical debt with compound interest.

This guide takes you from an empty directory to a production-grade Terraform setup that a team of engineers can work in without stepping on each other. We're covering project structure, module design, state management, environment promotion, testing, and CI/CD. Everything I've learned running Terraform across platform teams managing hundreds of resources.

If you've never written Terraform, start at Part 1. If you're already running Terraform in production and it's messy, skip to Part 3.

Part 1: The Foundation

Installing and Configuring Terraform

# Install via tfenv for version management (always use tfenv)
git clone https://github.com/tfutils/tfenv.git ~/.tfenv
echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Install a specific version
tfenv install 1.10.3
tfenv use 1.10.3

# Pin the version in your repo
echo "1.10.3" > .terraform-version

Your First Terraform Configuration

# versions.tf — Always pin your providers
terraform {
  required_version = ">= 1.10.0, < 2.0.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.80"
    }
  }
}

# provider.tf
provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = var.environment
      Repository  = "github.com/myorg/infrastructure"
    }
  }
}

The default_tags block is non-negotiable. Every resource gets tagged with who manages it and where the code lives. When someone finds a resource in the console and wonders "who created this?", the tags answer that question.

Variables and Locals Done Right

# variables.tf
variable "environment" {
  description = "Deployment environment (dev, staging, production)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "vpc_cidr" {
  description = "CIDR block for the VPC"
  type        = string
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid CIDR block."
  }
}

# locals.tf
locals {
  name_prefix = "${var.environment}-myapp"

  common_tags = {
    Environment = var.environment
    Project     = "myapp"
  }

  # Compute subnet CIDRs from VPC CIDR
  azs             = slice(data.aws_availability_zones.available.names, 0, 3)
  public_subnets  = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i)]
  private_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i + 10)]
}

data "aws_availability_zones" "available" {
  state = "available"
}

Use validation blocks on every variable that has constraints. Catch misconfigurations at terraform plan, not during a failed deployment.

Part 2: Project Structure for Teams

The Repository Layout

infrastructure/
├── .terraform-version          # Pin Terraform version
├── .tflint.hcl                 # Linting configuration
├── modules/                    # Reusable modules
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── versions.tf
│   ├── eks-cluster/
│   ├── rds/
│   └── s3-bucket/
├── environments/               # Environment-specific configurations
│   ├── dev/
│   │   ├── main.tf
│   │   ├── backend.tf
│   │   ├── terraform.tfvars
│   │   └── outputs.tf
│   ├── staging/
│   │   ├── main.tf
│   │   ├── backend.tf
│   │   ├── terraform.tfvars
│   │   └── outputs.tf
│   └── production/
│       ├── main.tf
│       ├── backend.tf
│       ├── terraform.tfvars
│       └── outputs.tf
└── global/                     # Shared resources (IAM, DNS)
    ├── iam/
    ├── route53/
    └── ecr/

Environment Configuration

Each environment calls the same modules with different parameters:

# environments/production/main.tf
module "vpc" {
  source = "../../modules/vpc"

  environment     = var.environment
  vpc_cidr        = var.vpc_cidr
  azs             = local.azs
  public_subnets  = local.public_subnets
  private_subnets = local.private_subnets

  enable_nat_gateway   = true
  single_nat_gateway   = false  # HA NAT in production
  enable_vpn_gateway   = false
  enable_flow_logs     = true
  flow_logs_retention  = 90
}

module "eks" {
  source = "../../modules/eks-cluster"

  cluster_name    = "${var.environment}-main"
  cluster_version = "1.31"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnet_ids

  node_groups = {
    general = {
      instance_types = ["m7g.xlarge"]
      min_size       = 3
      max_size       = 10
      desired_size   = 5
    }
    spot = {
      instance_types = ["m5.large", "m5a.large", "m6i.large", "m7g.large"]
      capacity_type  = "SPOT"
      min_size       = 0
      max_size       = 20
      desired_size   = 3
    }
  }

  enable_cluster_autoscaler = true
  enable_metrics_server     = true
}

module "rds" {
  source = "../../modules/rds"

  identifier          = "${var.environment}-app-db"
  engine_version      = "16.4"
  instance_class      = "db.r7g.xlarge"
  allocated_storage   = 100
  multi_az            = true  # Always in production
  vpc_id              = module.vpc.vpc_id
  subnet_ids          = module.vpc.private_subnet_ids
  allowed_cidr_blocks = module.vpc.private_subnet_cidrs

  backup_retention_period = 30
  deletion_protection     = true
}
# environments/dev/main.tf — Same modules, cheaper settings
module "vpc" {
  source = "../../modules/vpc"

  environment     = var.environment
  vpc_cidr        = var.vpc_cidr
  azs             = local.azs
  public_subnets  = local.public_subnets
  private_subnets = local.private_subnets

  enable_nat_gateway   = true
  single_nat_gateway   = true   # Save money in dev
  enable_vpn_gateway   = false
  enable_flow_logs     = false  # Not needed in dev
}

This is the power of modules. Same infrastructure, different scale. Dev costs a fraction of production, but the architecture is identical.

Part 3: State Management

Remote State with S3 and DynamoDB

Never, ever use local state in a team environment.

# environments/production/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Bootstrap the state backend (do this once, manually):

# bootstrap/main.tf — Run this first, locally
resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket                  = aws_s3_bucket.terraform_state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }
}

State Isolation Strategy

One state file per environment, per component. Never put everything in one state file.

State files:
├── global/iam.tfstate           # IAM roles, policies
├── global/route53.tfstate       # DNS zones
├── dev/infrastructure.tfstate   # Dev VPC, EKS, RDS
├── staging/infrastructure.tfstate
├── production/infrastructure.tfstate

Use terraform_remote_state data source to reference across state boundaries:

# Reference VPC outputs from the network state
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "myorg-terraform-state"
    key    = "${var.environment}/network.tfstate"
    region = "us-east-1"
  }
}

# Use the outputs
resource "aws_security_group" "app" {
  vpc_id = data.terraform_remote_state.network.outputs.vpc_id
  # ...
}

Part 4: Module Design Patterns

The Opinionated Module

Good modules make common things easy and uncommon things possible:

# modules/s3-bucket/main.tf
resource "aws_s3_bucket" "this" {
  bucket = var.bucket_name

  tags = merge(var.tags, {
    Module = "s3-bucket"
  })
}

resource "aws_s3_bucket_versioning" "this" {
  bucket = aws_s3_bucket.this.id
  versioning_configuration {
    status = var.enable_versioning ? "Enabled" : "Disabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  bucket = aws_s3_bucket.this.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = var.kms_key_arn != null ? "aws:kms" : "AES256"
      kms_master_key_id = var.kms_key_arn
    }
    bucket_key_enabled = var.kms_key_arn != null
  }
}

# Always block public access — override requires explicit opt-in
resource "aws_s3_bucket_public_access_block" "this" {
  bucket                  = aws_s3_bucket.this.id
  block_public_acls       = var.allow_public_access ? false : true
  block_public_policy     = var.allow_public_access ? false : true
  ignore_public_acls      = var.allow_public_access ? false : true
  restrict_public_buckets = var.allow_public_access ? false : true
}

resource "aws_s3_bucket_lifecycle_configuration" "this" {
  count  = length(var.lifecycle_rules) > 0 ? 1 : 0
  bucket = aws_s3_bucket.this.id

  dynamic "rule" {
    for_each = var.lifecycle_rules
    content {
      id     = rule.value.id
      status = "Enabled"

      transition {
        days          = rule.value.transition_days
        storage_class = rule.value.storage_class
      }

      dynamic "expiration" {
        for_each = rule.value.expiration_days != null ? [1] : []
        content {
          days = rule.value.expiration_days
        }
      }
    }
  }
}
# modules/s3-bucket/variables.tf
variable "bucket_name" {
  description = "Name of the S3 bucket"
  type        = string
}

variable "enable_versioning" {
  description = "Enable bucket versioning"
  type        = bool
  default     = true  # Safe default
}

variable "kms_key_arn" {
  description = "KMS key ARN for encryption (null = AES256)"
  type        = string
  default     = null
}

variable "allow_public_access" {
  description = "Allow public access (must explicitly opt in)"
  type        = bool
  default     = false  # Secure default
}

variable "lifecycle_rules" {
  description = "List of lifecycle rules"
  type = list(object({
    id              = string
    transition_days = number
    storage_class   = string
    expiration_days = optional(number)
  }))
  default = []
}

variable "tags" {
  description = "Tags to apply to all resources"
  type        = map(string)
  default     = {}
}

Notice the defaults: versioning on, encryption on, public access blocked. The secure path is the easy path. If someone wants to make a bucket public, they have to explicitly set allow_public_access = true and explain why in the PR.

Part 5: Testing Your Infrastructure

Terraform Validate and TFLint

# .tflint.hcl
config {
  call_module_type = "local"
}

plugin "terraform" {
  enabled = true
  preset  = "recommended"
}

plugin "aws" {
  enabled = true
  version = "0.35.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_naming_convention" {
  enabled = true
}

rule "terraform_documented_variables" {
  enabled = true
}

Terratest for Integration Testing

// test/vpc_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    t.Parallel()

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "environment": "test",
            "vpc_cidr":    "10.99.0.0/16",
        },
    })

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)

    privateSubnets := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
    assert.Equal(t, 3, len(privateSubnets))
}

Policy as Code with OPA/Conftest

# policy/terraform.rego
package terraform

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not has_encryption(resource)
    msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.address])
}

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_security_group_rule"
    resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
    resource.change.after.type == "ingress"
    msg := sprintf("Security group rule '%s' allows ingress from 0.0.0.0/0", [resource.address])
}

has_encryption(resource) {
    resource.change.after_unknown.server_side_encryption_configuration
}
# Run policy checks against the plan
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policy/

Part 6: CI/CD Pipeline

GitHub Actions for Terraform

name: Terraform
on:
  pull_request:
    paths:
      - 'environments/**'
      - 'modules/**'
  push:
    branches: [main]
    paths:
      - 'environments/**'
      - 'modules/**'

permissions:
  contents: read
  pull-requests: write
  id-token: write

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      environments: ${{ steps.changes.outputs.environments }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        run: |
          # Detect which environments changed
          ENVS=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} | \
            grep -oP 'environments/\K[^/]+' | sort -u | jq -R -s -c 'split("\n")[:-1]')
          echo "environments=$ENVS" >> "$GITHUB_OUTPUT"

  plan:
    needs: detect-changes
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.10.3

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
          aws-region: us-east-1

      - name: Terraform Init
        working-directory: environments/${{ matrix.environment }}
        run: terraform init -no-color

      - name: Terraform Validate
        working-directory: environments/${{ matrix.environment }}
        run: terraform validate -no-color

      - name: TFLint
        run: |
          tflint --init
          tflint --chdir environments/${{ matrix.environment }}

      - name: Terraform Plan
        id: plan
        working-directory: environments/${{ matrix.environment }}
        run: |
          terraform plan -no-color -out=plan.tfplan 2>&1 | tee plan.txt
          terraform show -json plan.tfplan > plan.json

      - name: Policy Check
        run: conftest test environments/${{ matrix.environment }}/plan.json --policy policy/

      - name: Comment PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('environments/${{ matrix.environment }}/plan.txt', 'utf8');
            const truncated = plan.length > 60000 ? plan.substring(0, 60000) + '\n\n... truncated' : plan;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `### Terraform Plan — \`${{ matrix.environment }}\`\n\`\`\`\n${truncated}\n\`\`\``
            });

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
    strategy:
      max-parallel: 1  # Apply one environment at a time
      matrix:
        environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
    environment: ${{ matrix.environment }}
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.10.3
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
          aws-region: us-east-1
      - name: Terraform Apply
        working-directory: environments/${{ matrix.environment }}
        run: |
          terraform init -no-color
          terraform apply -auto-approve -no-color

The max-parallel: 1 on the apply job is critical. You don't want to apply staging and production simultaneously.

Part 7: State Surgery and Disaster Recovery

Eventually, you'll need to manipulate state directly. These operations are dangerous but sometimes necessary.

Common State Operations

# Import an existing resource into Terraform state
terraform import aws_s3_bucket.logs my-existing-log-bucket

# Remove a resource from state without destroying it
terraform state rm aws_s3_bucket.legacy_data

# Move a resource to a different address (after refactoring)
terraform state mv aws_s3_bucket.this module.storage.aws_s3_bucket.this

# List all resources in state
terraform state list

# Show details of a specific resource
terraform state show aws_s3_bucket.this

State Backup and Recovery

Always back up state before surgery:

# Pull current state
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).tfstate

# If something goes wrong, push the backup
terraform state push state-backup-20260323-143000.tfstate

For S3 backends with versioning enabled, you can also recover previous state versions through the S3 console or CLI:

# List state file versions
aws s3api list-object-versions \
  --bucket myorg-terraform-state \
  --prefix production/infrastructure.tfstate \
  --query 'Versions[0:5].{VersionId:VersionId,Modified:LastModified,Size:Size}'

# Download a previous version
aws s3api get-object \
  --bucket myorg-terraform-state \
  --key production/infrastructure.tfstate \
  --version-id "abc123" \
  recovered-state.tfstate

Handling State Lock Issues

When a terraform apply is interrupted (CI runner dies, network drops), the DynamoDB lock remains:

# Check for stuck locks
aws dynamodb scan --table-name terraform-locks \
  --query 'Items[*].{LockID: LockID.S, Info: Info.S}'

# Force unlock (only when you're certain no one else is running)
terraform force-unlock <LOCK-ID>

Part 8: Terraform Import and Brownfield Adoption

Most organizations aren't starting from scratch. You have existing infrastructure that needs to be brought under Terraform management.

Bulk Import Strategy

# Use import blocks (Terraform 1.5+) for declarative imports
import {
  to = aws_vpc.main
  id = "vpc-0123456789abcdef0"
}

import {
  to = aws_subnet.private["us-east-1a"]
  id = "subnet-0123456789abcdef0"
}

import {
  to = aws_subnet.private["us-east-1b"]
  id = "subnet-0987654321fedcba0"
}

# Then run terraform plan to generate the configuration
terraform plan -generate-config-out=generated.tf

The -generate-config-out flag is a game-changer for brownfield adoption. It reverse-engineers the resource configuration from AWS and writes it as Terraform code. You'll need to clean it up — remove computed attributes, parameterize values, extract into modules — but it's a massive head start over writing everything from scratch.

Migration Workflow

  1. Inventory existing resources using AWS Config or aws resourcegroupstaggingapi get-resources.
  2. Write import blocks for each resource.
  3. Generate configuration with terraform plan -generate-config-out.
  4. Clean up generated code — extract variables, remove defaults, organize into files.
  5. Run terraform plan — it should show zero changes if the import and config are correct.
  6. Add to CI/CD and treat it as managed infrastructure going forward.

Troubleshooting Common Terraform Issues

"Error acquiring the state lock"

# Someone else is running terraform, or a previous run crashed
# First, check who holds the lock
terraform force-unlock <LOCK-ID>  # Only if you're sure it's stale

"Provider produced inconsistent result"

This happens when a resource attribute changes outside of Terraform (someone clicked in the console):

# Refresh state to match reality
terraform apply -refresh-only

# Review the changes, then approve

"Cycle detected in resource dependencies"

Break the cycle by using depends_on explicitly or restructuring your resources:

# Instead of circular references between security groups:
resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = var.vpc_id
}

resource "aws_security_group" "db" {
  name   = "db-sg"
  vpc_id = var.vpc_id
}

# Add rules as separate resources to break the cycle
resource "aws_security_group_rule" "app_to_db" {
  type                     = "egress"
  security_group_id        = aws_security_group.app.id
  source_security_group_id = aws_security_group.db.id
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
}

resource "aws_security_group_rule" "db_from_app" {
  type                     = "ingress"
  security_group_id        = aws_security_group.db.id
  source_security_group_id = aws_security_group.app.id
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
}

"Error: Unsupported attribute" After Provider Upgrade

Pin your providers and upgrade deliberately:

# Check which providers need updates
terraform providers lock -platform=linux_amd64 -platform=darwin_arm64

# Update one provider at a time
terraform init -upgrade

# Run plan immediately to catch breaking changes
terraform plan

The Golden Rules

  1. Pin everything. Terraform version, provider versions, module versions. Unpinned versions are time bombs.
  2. State is sacred. Use remote state, enable locking, enable versioning. Corrupted state is the worst Terraform failure mode.
  3. Modules enforce standards. Security defaults baked into modules mean every team gets the right configuration by default.
  4. Plan is mandatory. Never apply without reviewing the plan. Automate the plan, require human review before apply.
  5. Environments should differ in scale, not structure. If your dev and production infrastructure are architecturally different, you're going to have a bad time.
  6. Blast radius matters. Small state files, small changes, small blast radius. A change that modifies 50 resources in one apply is a change that can break 50 things at once.
  7. Import before you recreate. If the resource exists in AWS, import it into state. Don't destroy and recreate — that causes downtime and data loss.
  8. Use moved blocks for refactoring. When reorganizing code into modules, use moved blocks instead of state manipulation. They're declarative, reviewable, and reversible.
# When moving a resource into a module
moved {
  from = aws_s3_bucket.logs
  to   = module.logging.aws_s3_bucket.this
}

Infrastructure as code isn't just about automation — it's about building systems that a team can understand, review, and trust. When anyone on the team can read a PR and understand exactly what infrastructure will change, you've achieved the goal. That's what production-grade Terraform looks like.

The journey from a single main.tf file to a fully modularized, tested, CI/CD-driven Terraform setup takes time. Don't try to build the perfect setup on day one. Start with remote state and locking. Then extract your first module. Then add CI with automated plan comments. Each step makes your infrastructure more reliable, more reviewable, and more scalable. A year from now, you'll look back at the investment and wonder how you ever managed infrastructure any other way.

Share:
Zara Blackwood
Zara Blackwood

Platform Engineer

Terraform enthusiast, platform builder, DRY advocate. I believe infrastructure should be versioned, reviewed, and deployed like any other code. GitOps or bust.

Related Articles