Terraform from Zero to Production: Project Structure, Modules, State, and CI/CD
Infrastructure That Isn't in Code Doesn't Exist
I've said this before and I'll keep saying it: if your infrastructure isn't versioned, reviewed, and deployed through a pipeline, it's a liability. ClickOps is technical debt with compound interest.
This guide takes you from an empty directory to a production-grade Terraform setup that a team of engineers can work in without stepping on each other. We're covering project structure, module design, state management, environment promotion, testing, and CI/CD. Everything I've learned running Terraform across platform teams managing hundreds of resources.
If you've never written Terraform, start at Part 1. If you're already running Terraform in production and it's messy, skip to Part 3.
Part 1: The Foundation
Installing and Configuring Terraform
# Install via tfenv for version management (always use tfenv)
git clone https://github.com/tfutils/tfenv.git ~/.tfenv
echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Install a specific version
tfenv install 1.10.3
tfenv use 1.10.3
# Pin the version in your repo
echo "1.10.3" > .terraform-version
Your First Terraform Configuration
# versions.tf — Always pin your providers
terraform {
required_version = ">= 1.10.0, < 2.0.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.80"
}
}
}
# provider.tf
provider "aws" {
region = var.aws_region
default_tags {
tags = {
ManagedBy = "terraform"
Environment = var.environment
Repository = "github.com/myorg/infrastructure"
}
}
}
The default_tags block is non-negotiable. Every resource gets tagged with who manages it and where the code lives. When someone finds a resource in the console and wonders "who created this?", the tags answer that question.
Variables and Locals Done Right
# variables.tf
variable "environment" {
description = "Deployment environment (dev, staging, production)"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "aws_region" {
description = "AWS region for resources"
type = string
default = "us-east-1"
}
variable "vpc_cidr" {
description = "CIDR block for the VPC"
type = string
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "Must be a valid CIDR block."
}
}
# locals.tf
locals {
name_prefix = "${var.environment}-myapp"
common_tags = {
Environment = var.environment
Project = "myapp"
}
# Compute subnet CIDRs from VPC CIDR
azs = slice(data.aws_availability_zones.available.names, 0, 3)
public_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i)]
private_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i + 10)]
}
data "aws_availability_zones" "available" {
state = "available"
}
Use validation blocks on every variable that has constraints. Catch misconfigurations at terraform plan, not during a failed deployment.
Part 2: Project Structure for Teams
The Repository Layout
infrastructure/
├── .terraform-version # Pin Terraform version
├── .tflint.hcl # Linting configuration
├── modules/ # Reusable modules
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── versions.tf
│ ├── eks-cluster/
│ ├── rds/
│ └── s3-bucket/
├── environments/ # Environment-specific configurations
│ ├── dev/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ ├── terraform.tfvars
│ │ └── outputs.tf
│ ├── staging/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ ├── terraform.tfvars
│ │ └── outputs.tf
│ └── production/
│ ├── main.tf
│ ├── backend.tf
│ ├── terraform.tfvars
│ └── outputs.tf
└── global/ # Shared resources (IAM, DNS)
├── iam/
├── route53/
└── ecr/
Environment Configuration
Each environment calls the same modules with different parameters:
# environments/production/main.tf
module "vpc" {
source = "../../modules/vpc"
environment = var.environment
vpc_cidr = var.vpc_cidr
azs = local.azs
public_subnets = local.public_subnets
private_subnets = local.private_subnets
enable_nat_gateway = true
single_nat_gateway = false # HA NAT in production
enable_vpn_gateway = false
enable_flow_logs = true
flow_logs_retention = 90
}
module "eks" {
source = "../../modules/eks-cluster"
cluster_name = "${var.environment}-main"
cluster_version = "1.31"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
node_groups = {
general = {
instance_types = ["m7g.xlarge"]
min_size = 3
max_size = 10
desired_size = 5
}
spot = {
instance_types = ["m5.large", "m5a.large", "m6i.large", "m7g.large"]
capacity_type = "SPOT"
min_size = 0
max_size = 20
desired_size = 3
}
}
enable_cluster_autoscaler = true
enable_metrics_server = true
}
module "rds" {
source = "../../modules/rds"
identifier = "${var.environment}-app-db"
engine_version = "16.4"
instance_class = "db.r7g.xlarge"
allocated_storage = 100
multi_az = true # Always in production
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
allowed_cidr_blocks = module.vpc.private_subnet_cidrs
backup_retention_period = 30
deletion_protection = true
}
# environments/dev/main.tf — Same modules, cheaper settings
module "vpc" {
source = "../../modules/vpc"
environment = var.environment
vpc_cidr = var.vpc_cidr
azs = local.azs
public_subnets = local.public_subnets
private_subnets = local.private_subnets
enable_nat_gateway = true
single_nat_gateway = true # Save money in dev
enable_vpn_gateway = false
enable_flow_logs = false # Not needed in dev
}
This is the power of modules. Same infrastructure, different scale. Dev costs a fraction of production, but the architecture is identical.
Part 3: State Management
Remote State with S3 and DynamoDB
Never, ever use local state in a team environment.
# environments/production/backend.tf
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Bootstrap the state backend (do this once, manually):
# bootstrap/main.tf — Run this first, locally
resource "aws_s3_bucket" "terraform_state" {
bucket = "myorg-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
point_in_time_recovery {
enabled = true
}
}
State Isolation Strategy
One state file per environment, per component. Never put everything in one state file.
State files:
├── global/iam.tfstate # IAM roles, policies
├── global/route53.tfstate # DNS zones
├── dev/infrastructure.tfstate # Dev VPC, EKS, RDS
├── staging/infrastructure.tfstate
├── production/infrastructure.tfstate
Use terraform_remote_state data source to reference across state boundaries:
# Reference VPC outputs from the network state
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "myorg-terraform-state"
key = "${var.environment}/network.tfstate"
region = "us-east-1"
}
}
# Use the outputs
resource "aws_security_group" "app" {
vpc_id = data.terraform_remote_state.network.outputs.vpc_id
# ...
}
Part 4: Module Design Patterns
The Opinionated Module
Good modules make common things easy and uncommon things possible:
# modules/s3-bucket/main.tf
resource "aws_s3_bucket" "this" {
bucket = var.bucket_name
tags = merge(var.tags, {
Module = "s3-bucket"
})
}
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.this.id
versioning_configuration {
status = var.enable_versioning ? "Enabled" : "Disabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
bucket = aws_s3_bucket.this.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = var.kms_key_arn != null ? "aws:kms" : "AES256"
kms_master_key_id = var.kms_key_arn
}
bucket_key_enabled = var.kms_key_arn != null
}
}
# Always block public access — override requires explicit opt-in
resource "aws_s3_bucket_public_access_block" "this" {
bucket = aws_s3_bucket.this.id
block_public_acls = var.allow_public_access ? false : true
block_public_policy = var.allow_public_access ? false : true
ignore_public_acls = var.allow_public_access ? false : true
restrict_public_buckets = var.allow_public_access ? false : true
}
resource "aws_s3_bucket_lifecycle_configuration" "this" {
count = length(var.lifecycle_rules) > 0 ? 1 : 0
bucket = aws_s3_bucket.this.id
dynamic "rule" {
for_each = var.lifecycle_rules
content {
id = rule.value.id
status = "Enabled"
transition {
days = rule.value.transition_days
storage_class = rule.value.storage_class
}
dynamic "expiration" {
for_each = rule.value.expiration_days != null ? [1] : []
content {
days = rule.value.expiration_days
}
}
}
}
}
# modules/s3-bucket/variables.tf
variable "bucket_name" {
description = "Name of the S3 bucket"
type = string
}
variable "enable_versioning" {
description = "Enable bucket versioning"
type = bool
default = true # Safe default
}
variable "kms_key_arn" {
description = "KMS key ARN for encryption (null = AES256)"
type = string
default = null
}
variable "allow_public_access" {
description = "Allow public access (must explicitly opt in)"
type = bool
default = false # Secure default
}
variable "lifecycle_rules" {
description = "List of lifecycle rules"
type = list(object({
id = string
transition_days = number
storage_class = string
expiration_days = optional(number)
}))
default = []
}
variable "tags" {
description = "Tags to apply to all resources"
type = map(string)
default = {}
}
Notice the defaults: versioning on, encryption on, public access blocked. The secure path is the easy path. If someone wants to make a bucket public, they have to explicitly set allow_public_access = true and explain why in the PR.
Part 5: Testing Your Infrastructure
Terraform Validate and TFLint
# .tflint.hcl
config {
call_module_type = "local"
}
plugin "terraform" {
enabled = true
preset = "recommended"
}
plugin "aws" {
enabled = true
version = "0.35.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
rule "terraform_naming_convention" {
enabled = true
}
rule "terraform_documented_variables" {
enabled = true
}
Terratest for Integration Testing
// test/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestVpcModule(t *testing.T) {
t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"environment": "test",
"vpc_cidr": "10.99.0.0/16",
},
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcId)
privateSubnets := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
assert.Equal(t, 3, len(privateSubnets))
}
Policy as Code with OPA/Conftest
# policy/terraform.rego
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
not has_encryption(resource)
msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.address])
}
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_security_group_rule"
resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
resource.change.after.type == "ingress"
msg := sprintf("Security group rule '%s' allows ingress from 0.0.0.0/0", [resource.address])
}
has_encryption(resource) {
resource.change.after_unknown.server_side_encryption_configuration
}
# Run policy checks against the plan
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policy/
Part 6: CI/CD Pipeline
GitHub Actions for Terraform
name: Terraform
on:
pull_request:
paths:
- 'environments/**'
- 'modules/**'
push:
branches: [main]
paths:
- 'environments/**'
- 'modules/**'
permissions:
contents: read
pull-requests: write
id-token: write
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
environments: ${{ steps.changes.outputs.environments }}
steps:
- uses: actions/checkout@v4
- id: changes
run: |
# Detect which environments changed
ENVS=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} | \
grep -oP 'environments/\K[^/]+' | sort -u | jq -R -s -c 'split("\n")[:-1]')
echo "environments=$ENVS" >> "$GITHUB_OUTPUT"
plan:
needs: detect-changes
runs-on: ubuntu-latest
strategy:
matrix:
environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.10.3
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
aws-region: us-east-1
- name: Terraform Init
working-directory: environments/${{ matrix.environment }}
run: terraform init -no-color
- name: Terraform Validate
working-directory: environments/${{ matrix.environment }}
run: terraform validate -no-color
- name: TFLint
run: |
tflint --init
tflint --chdir environments/${{ matrix.environment }}
- name: Terraform Plan
id: plan
working-directory: environments/${{ matrix.environment }}
run: |
terraform plan -no-color -out=plan.tfplan 2>&1 | tee plan.txt
terraform show -json plan.tfplan > plan.json
- name: Policy Check
run: conftest test environments/${{ matrix.environment }}/plan.json --policy policy/
- name: Comment PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('environments/${{ matrix.environment }}/plan.txt', 'utf8');
const truncated = plan.length > 60000 ? plan.substring(0, 60000) + '\n\n... truncated' : plan;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `### Terraform Plan — \`${{ matrix.environment }}\`\n\`\`\`\n${truncated}\n\`\`\``
});
apply:
needs: plan
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
runs-on: ubuntu-latest
strategy:
max-parallel: 1 # Apply one environment at a time
matrix:
environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
environment: ${{ matrix.environment }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.10.3
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
aws-region: us-east-1
- name: Terraform Apply
working-directory: environments/${{ matrix.environment }}
run: |
terraform init -no-color
terraform apply -auto-approve -no-color
The max-parallel: 1 on the apply job is critical. You don't want to apply staging and production simultaneously.
Part 7: State Surgery and Disaster Recovery
Eventually, you'll need to manipulate state directly. These operations are dangerous but sometimes necessary.
Common State Operations
# Import an existing resource into Terraform state
terraform import aws_s3_bucket.logs my-existing-log-bucket
# Remove a resource from state without destroying it
terraform state rm aws_s3_bucket.legacy_data
# Move a resource to a different address (after refactoring)
terraform state mv aws_s3_bucket.this module.storage.aws_s3_bucket.this
# List all resources in state
terraform state list
# Show details of a specific resource
terraform state show aws_s3_bucket.this
State Backup and Recovery
Always back up state before surgery:
# Pull current state
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).tfstate
# If something goes wrong, push the backup
terraform state push state-backup-20260323-143000.tfstate
For S3 backends with versioning enabled, you can also recover previous state versions through the S3 console or CLI:
# List state file versions
aws s3api list-object-versions \
--bucket myorg-terraform-state \
--prefix production/infrastructure.tfstate \
--query 'Versions[0:5].{VersionId:VersionId,Modified:LastModified,Size:Size}'
# Download a previous version
aws s3api get-object \
--bucket myorg-terraform-state \
--key production/infrastructure.tfstate \
--version-id "abc123" \
recovered-state.tfstate
Handling State Lock Issues
When a terraform apply is interrupted (CI runner dies, network drops), the DynamoDB lock remains:
# Check for stuck locks
aws dynamodb scan --table-name terraform-locks \
--query 'Items[*].{LockID: LockID.S, Info: Info.S}'
# Force unlock (only when you're certain no one else is running)
terraform force-unlock <LOCK-ID>
Part 8: Terraform Import and Brownfield Adoption
Most organizations aren't starting from scratch. You have existing infrastructure that needs to be brought under Terraform management.
Bulk Import Strategy
# Use import blocks (Terraform 1.5+) for declarative imports
import {
to = aws_vpc.main
id = "vpc-0123456789abcdef0"
}
import {
to = aws_subnet.private["us-east-1a"]
id = "subnet-0123456789abcdef0"
}
import {
to = aws_subnet.private["us-east-1b"]
id = "subnet-0987654321fedcba0"
}
# Then run terraform plan to generate the configuration
terraform plan -generate-config-out=generated.tf
The -generate-config-out flag is a game-changer for brownfield adoption. It reverse-engineers the resource configuration from AWS and writes it as Terraform code. You'll need to clean it up — remove computed attributes, parameterize values, extract into modules — but it's a massive head start over writing everything from scratch.
Migration Workflow
- Inventory existing resources using AWS Config or
aws resourcegroupstaggingapi get-resources. - Write import blocks for each resource.
- Generate configuration with
terraform plan -generate-config-out. - Clean up generated code — extract variables, remove defaults, organize into files.
- Run
terraform plan— it should show zero changes if the import and config are correct. - Add to CI/CD and treat it as managed infrastructure going forward.
Troubleshooting Common Terraform Issues
"Error acquiring the state lock"
# Someone else is running terraform, or a previous run crashed
# First, check who holds the lock
terraform force-unlock <LOCK-ID> # Only if you're sure it's stale
"Provider produced inconsistent result"
This happens when a resource attribute changes outside of Terraform (someone clicked in the console):
# Refresh state to match reality
terraform apply -refresh-only
# Review the changes, then approve
"Cycle detected in resource dependencies"
Break the cycle by using depends_on explicitly or restructuring your resources:
# Instead of circular references between security groups:
resource "aws_security_group" "app" {
name = "app-sg"
vpc_id = var.vpc_id
}
resource "aws_security_group" "db" {
name = "db-sg"
vpc_id = var.vpc_id
}
# Add rules as separate resources to break the cycle
resource "aws_security_group_rule" "app_to_db" {
type = "egress"
security_group_id = aws_security_group.app.id
source_security_group_id = aws_security_group.db.id
from_port = 5432
to_port = 5432
protocol = "tcp"
}
resource "aws_security_group_rule" "db_from_app" {
type = "ingress"
security_group_id = aws_security_group.db.id
source_security_group_id = aws_security_group.app.id
from_port = 5432
to_port = 5432
protocol = "tcp"
}
"Error: Unsupported attribute" After Provider Upgrade
Pin your providers and upgrade deliberately:
# Check which providers need updates
terraform providers lock -platform=linux_amd64 -platform=darwin_arm64
# Update one provider at a time
terraform init -upgrade
# Run plan immediately to catch breaking changes
terraform plan
The Golden Rules
- Pin everything. Terraform version, provider versions, module versions. Unpinned versions are time bombs.
- State is sacred. Use remote state, enable locking, enable versioning. Corrupted state is the worst Terraform failure mode.
- Modules enforce standards. Security defaults baked into modules mean every team gets the right configuration by default.
- Plan is mandatory. Never apply without reviewing the plan. Automate the plan, require human review before apply.
- Environments should differ in scale, not structure. If your dev and production infrastructure are architecturally different, you're going to have a bad time.
- Blast radius matters. Small state files, small changes, small blast radius. A change that modifies 50 resources in one apply is a change that can break 50 things at once.
- Import before you recreate. If the resource exists in AWS, import it into state. Don't destroy and recreate — that causes downtime and data loss.
- Use
movedblocks for refactoring. When reorganizing code into modules, usemovedblocks instead of state manipulation. They're declarative, reviewable, and reversible.
# When moving a resource into a module
moved {
from = aws_s3_bucket.logs
to = module.logging.aws_s3_bucket.this
}
Infrastructure as code isn't just about automation — it's about building systems that a team can understand, review, and trust. When anyone on the team can read a PR and understand exactly what infrastructure will change, you've achieved the goal. That's what production-grade Terraform looks like.
The journey from a single main.tf file to a fully modularized, tested, CI/CD-driven Terraform setup takes time. Don't try to build the perfect setup on day one. Start with remote state and locking. Then extract your first module. Then add CI with automated plan comments. Each step makes your infrastructure more reliable, more reviewable, and more scalable. A year from now, you'll look back at the investment and wonder how you ever managed infrastructure any other way.
Related Articles
Platform Engineer
Terraform enthusiast, platform builder, DRY advocate. I believe infrastructure should be versioned, reviewed, and deployed like any other code. GitOps or bust.
Related Articles
Terraform Module Design Patterns for Large Teams
Battle-tested Terraform module patterns for teams — from file structure to versioning to composition. If it's not in code, it doesn't exist.
Testing Terraform with Terratest: A Practical Guide
How to write unit and integration tests for Terraform modules using Terratest — because untested infrastructure is a liability.
Terraform Remote State: S3 Backends, Locking, Workspaces, and State Surgery
Everything you need to know about Terraform remote state — from setting up S3 backends with locking to workspace strategies and emergency state surgery.