The Complete Guide to GitHub Actions CI/CD: From Zero to Production-Ready Pipelines
Why This Guide Exists
I've built CI/CD pipelines at every scale — from solo projects to teams shipping 200+ deploys a day. The pattern is always the same: someone writes a workflow that works, then copy-pastes it across repos until it becomes an unmaintainable mess.
This guide is the pipeline playbook I wish I'd had on day one. We're going from a blank repo to a production-grade CI/CD system with testing, security scanning, staged deployments, and rollback capability. No skipping steps.
If you've never written a GitHub Actions workflow, start at the beginning. If you're already comfortable, skip to the advanced patterns. Either way, every code block here has been battle-tested in production.
Part 1: Understanding the Fundamentals
Anatomy of a Workflow
Every GitHub Actions workflow lives in .github/workflows/ and follows this structure:
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
workflow_dispatch: # Manual trigger
permissions:
contents: read
packages: write
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests
run: echo "Tests go here"
Key concepts:
ondefines triggers. Every workflow needs at least one.permissionsfollows least privilege. Never usepermissions: write-all.envsets workflow-level environment variables.jobsrun in parallel by default. Useneedsfor dependencies.
Event Triggers That Actually Matter
Most teams only need these triggers:
on:
push:
branches: [main]
paths-ignore:
- '**.md'
- 'docs/**'
pull_request:
types: [opened, synchronize, reopened]
release:
types: [published]
schedule:
- cron: '0 6 * * 1' # Monday 6 AM UTC
workflow_dispatch:
inputs:
environment:
description: 'Target environment'
required: true
type: choice
options: [staging, production]
The paths-ignore filter is crucial. Nobody needs a full CI run because you fixed a typo in the README.
Part 2: Building the Test Pipeline
Step 1: Lint, Type Check, Unit Test
This is your first line of defense. Every PR runs through this.
name: CI
on:
pull_request:
push:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run type-check
test:
runs-on: ubuntu-latest
needs: lint
strategy:
fail-fast: false
matrix:
node: [20, 22]
shard: [1, 2, 3]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node }}
cache: 'npm'
- run: npm ci
- run: npm test -- --shard=${{ matrix.shard }}/3
- uses: actions/upload-artifact@v4
if: failure()
with:
name: test-results-${{ matrix.node }}-${{ matrix.shard }}
path: coverage/
retention-days: 7
Notice the test sharding. Three shards running in parallel means your 9-minute test suite finishes in 3 minutes. The fail-fast: false ensures all shards complete so you see every failure, not just the first one.
Step 2: Integration Tests with Services
Real applications need databases. GitHub Actions supports service containers natively.
integration:
runs-on: ubuntu-latest
needs: lint
services:
postgres:
image: postgres:16
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
ports:
- 6379:6379
options: --health-cmd "redis-cli ping" --health-interval 10s
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: 'npm'
- run: npm ci
- run: npm run test:integration
env:
DATABASE_URL: postgresql://test:test@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
The options block is important. Without health checks, your tests might start before Postgres is ready and fail with connection errors. I see this mistake constantly.
Part 3: Building and Publishing Container Images
Multi-Platform Builds with Caching
build:
runs-on: ubuntu-latest
needs: [test, integration]
permissions:
contents: read
packages: write
outputs:
image-digest: ${{ steps.build.outputs.digest }}
image-tag: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/${{ github.repository }}
tags: |
type=sha,prefix=
type=ref,event=branch
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
- id: build
uses: docker/build-push-action@v6
with:
context: .
push: true
platforms: linux/amd64,linux/arm64
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
provenance: true
sbom: true
The cache-from: type=gha uses GitHub's built-in cache. This takes a 12-minute Docker build down to 2 minutes on subsequent runs. The provenance and sbom flags generate supply chain attestations — you'll need these for compliance.
Part 4: Deployment Pipeline with Environments
Staged Deployments with Approval Gates
This is where most teams stop at "push to main = deploy to prod." Don't do that. Use environments.
deploy-staging:
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment:
name: staging
url: https://staging.example.com
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
kubectl set image deployment/app \
app=ghcr.io/${{ github.repository }}:${{ needs.build.outputs.image-tag }} \
--namespace staging
env:
KUBECONFIG_DATA: ${{ secrets.STAGING_KUBECONFIG }}
- name: Run smoke tests
run: |
for i in {1..30}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://staging.example.com/health)
if [ "$STATUS" = "200" ]; then
echo "Staging is healthy"
exit 0
fi
echo "Waiting for staging... (attempt $i)"
sleep 10
done
echo "Staging health check failed"
exit 1
deploy-production:
runs-on: ubuntu-latest
needs: [build, deploy-staging]
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://example.com
concurrency:
group: production
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
kubectl set image deployment/app \
app=ghcr.io/${{ github.repository }}:${{ needs.build.outputs.image-tag }} \
--namespace production
env:
KUBECONFIG_DATA: ${{ secrets.PROD_KUBECONFIG }}
- name: Verify deployment
run: |
kubectl rollout status deployment/app \
--namespace production \
--timeout=300s
Configure the production environment in GitHub Settings with:
- Required reviewers: At least one team lead must approve.
- Wait timer: 5-minute delay so you can cancel if something looks wrong.
- Branch restrictions: Only
maincan deploy.
The concurrency block prevents two production deployments from running simultaneously. Never set cancel-in-progress: true for production — you don't want to interrupt a running deployment.
Part 5: Reusable Workflows
DRY Pipelines Across Repositories
When you have 15 repos running the same pipeline, copy-paste is a liability. Reusable workflows fix this.
Create a shared workflow in a central repo:
# .github/workflows/node-ci.yml (in your shared-workflows repo)
name: Reusable Node.js CI
on:
workflow_call:
inputs:
node-version:
type: string
default: '22'
working-directory:
type: string
default: '.'
run-integration:
type: boolean
default: false
secrets:
NPM_TOKEN:
required: false
jobs:
ci:
runs-on: ubuntu-latest
defaults:
run:
working-directory: ${{ inputs.working-directory }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
cache-dependency-path: ${{ inputs.working-directory }}/package-lock.json
- run: npm ci
- run: npm run lint
- run: npm run type-check
- run: npm test
- if: inputs.run-integration
run: npm run test:integration
Consume it from any repo:
# .github/workflows/ci.yml (in any consuming repo)
name: CI
on:
pull_request:
push:
branches: [main]
jobs:
ci:
uses: my-org/shared-workflows/.github/workflows/node-ci.yml@v2
with:
node-version: '22'
run-integration: true
secrets:
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
Version your shared workflows with tags (@v2), not branch references. When you push a breaking change, consuming repos don't break until they explicitly upgrade.
Part 6: Security Hardening
Locking Down Your Pipeline
Every step that runs in your pipeline is code you're executing in a trusted environment. Treat it accordingly.
name: Secure CI
on:
pull_request:
permissions:
contents: read
jobs:
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Pin actions to full SHA, not tags
- uses: actions/setup-node@1d0ff469b7ec7b3cb9d8673fde0c81c44821de2a # v4.2.0
with:
node-version: 22
# Dependency review on PRs
- uses: actions/dependency-review-action@67d4f4e7a7a09e53d4baa05862b1e8b1c0338296 # v4.6.0
if: github.event_name == 'pull_request'
# SAST scanning
- uses: github/codeql-action/init@v3
with:
languages: javascript
- uses: github/codeql-action/analyze@v3
# Secret scanning
- uses: trufflesecurity/trufflehog@main
with:
extra_args: --only-verified
Key rules:
- Pin actions to commit SHAs, not version tags. Tags can be moved.
- Set
permissionsat the workflow level, not just job level. - Never echo secrets in run steps, even for debugging.
- Use
GITHUB_TOKENover PATs wherever possible — it's automatically scoped.
Branch Protection Rules
Your pipeline means nothing without branch protection:
# Set via GitHub CLI
gh api repos/{owner}/{repo}/branches/main/protection \
--method PUT \
--field required_status_checks='{"strict":true,"contexts":["ci","security"]}' \
--field enforce_admins=true \
--field required_pull_request_reviews='{"required_approving_review_count":1,"dismiss_stale_reviews":true}' \
--field restrictions=null
Part 7: Monitoring and Optimization
Track Pipeline Performance
Slow pipelines kill developer productivity. Measure and optimize.
report-metrics:
runs-on: ubuntu-latest
if: always()
needs: [lint, test, integration, build]
steps:
- name: Calculate pipeline duration
run: |
echo "## Pipeline Summary" >> $GITHUB_STEP_SUMMARY
echo "| Job | Status |" >> $GITHUB_STEP_SUMMARY
echo "|---|---|" >> $GITHUB_STEP_SUMMARY
echo "| Lint | ${{ needs.lint.result }} |" >> $GITHUB_STEP_SUMMARY
echo "| Test | ${{ needs.test.result }} |" >> $GITHUB_STEP_SUMMARY
echo "| Integration | ${{ needs.integration.result }} |" >> $GITHUB_STEP_SUMMARY
echo "| Build | ${{ needs.build.result }} |" >> $GITHUB_STEP_SUMMARY
Caching Strategy
Cache aggressively. Here's a comprehensive caching setup:
- uses: actions/cache@v4
with:
path: |
~/.npm
node_modules
.next/cache
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
- uses: actions/cache@v4
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx-
The Complete Pipeline
Here's everything assembled into one production-ready workflow. Copy it, customize it, ship it.
name: Production CI/CD
on:
push:
branches: [main]
pull_request:
release:
types: [published]
permissions:
contents: read
packages: write
jobs:
ci:
uses: ./.github/workflows/node-ci.yml
with:
run-integration: ${{ github.event_name == 'push' }}
security:
uses: ./.github/workflows/security-scan.yml
permissions:
security-events: write
contents: read
build:
needs: [ci, security]
if: github.ref == 'refs/heads/main' || github.event_name == 'release'
uses: ./.github/workflows/docker-build.yml
permissions:
packages: write
deploy-staging:
needs: build
if: github.ref == 'refs/heads/main'
uses: ./.github/workflows/deploy.yml
with:
environment: staging
secrets: inherit
deploy-production:
needs: deploy-staging
if: github.event_name == 'release'
uses: ./.github/workflows/deploy.yml
with:
environment: production
secrets: inherit
Part 8: Rollback Strategies
Deploying is only half the story. What happens when the deploy is bad?
Automated Rollback on Failed Health Check
deploy-production:
runs-on: ubuntu-latest
needs: [build, deploy-staging]
environment: production
steps:
- uses: actions/checkout@v4
- name: Save current revision for rollback
id: current
run: |
CURRENT_IMAGE=$(kubectl get deployment/app -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}')
echo "image=$CURRENT_IMAGE" >> "$GITHUB_OUTPUT"
- name: Deploy new version
run: |
kubectl set image deployment/app \
app=ghcr.io/${{ github.repository }}:${{ needs.build.outputs.image-tag }} \
--namespace production
kubectl rollout status deployment/app \
--namespace production --timeout=300s
- name: Post-deploy verification
id: verify
continue-on-error: true
run: |
sleep 30
for i in {1..10}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health)
if [ "$STATUS" != "200" ]; then
echo "Health check failed with status $STATUS"
exit 1
fi
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" https://api.example.com/health)
if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
echo "Response time ${RESPONSE_TIME}s exceeds 2s threshold"
exit 1
fi
done
- name: Rollback on failure
if: steps.verify.outcome == 'failure'
run: |
echo "Verification failed. Rolling back to ${{ steps.current.outputs.image }}"
kubectl set image deployment/app \
app=${{ steps.current.outputs.image }} \
--namespace production
kubectl rollout status deployment/app \
--namespace production --timeout=300s
exit 1
Part 9: Advanced Patterns
Path-Based Conditional Jobs for Monorepos
In a monorepo, you don't want to build everything when only one service changed:
name: Monorepo CI
on:
pull_request:
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
api: ${{ steps.filter.outputs.api }}
web: ${{ steps.filter.outputs.web }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
api:
- 'services/api/**'
- 'packages/shared/**'
web:
- 'services/web/**'
- 'packages/shared/**'
test-api:
needs: detect-changes
if: needs.detect-changes.outputs.api == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: cd services/api && npm ci && npm test
test-web:
needs: detect-changes
if: needs.detect-changes.outputs.web == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: cd services/web && npm ci && npm test
The packages/shared/** path in both filters ensures shared library changes trigger tests for all consumers.
Composite Actions for Shared Steps
When multiple workflows need the same setup, composite actions keep things DRY:
# .github/actions/setup-node-project/action.yml
name: Setup Node Project
description: Install Node.js, cache dependencies, install packages
inputs:
node-version:
description: Node.js version
default: '22'
working-directory:
description: Working directory
default: '.'
runs:
using: composite
steps:
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: npm
cache-dependency-path: ${{ inputs.working-directory }}/package-lock.json
- shell: bash
working-directory: ${{ inputs.working-directory }}
run: npm ci --prefer-offline
Workflow Dispatch for Manual Operations
Build operational runbooks as workflows:
name: Database Migration
on:
workflow_dispatch:
inputs:
environment:
description: Target environment
required: true
type: choice
options: [staging, production]
dry-run:
description: Dry run only
required: true
type: boolean
default: true
jobs:
migrate:
runs-on: ubuntu-latest
environment: ${{ inputs.environment }}
steps:
- uses: actions/checkout@v4
- name: Run migration
run: |
if [ "${{ inputs.dry-run }}" = "true" ]; then
echo "DRY RUN — showing pending migrations"
npx prisma migrate status
else
npx prisma migrate deploy
fi
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
Part 10: Self-Hosted Runners
GitHub-hosted runners are convenient but have limitations: 2-core machines, limited disk space, no GPU support, and you can't install custom tooling. For teams that need more, self-hosted runners are the answer.
Setting Up Self-Hosted Runners on Kubernetes
Use the Actions Runner Controller (ARC) to manage ephemeral runners in your cluster:
# Install ARC with Helm
helm install arc \
--namespace arc-systems \
--create-namespace \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
# Create a runner scale set
helm install arc-runner-set \
--namespace arc-runners \
--create-namespace \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
--set githubConfigUrl="https://github.com/myorg" \
--set githubConfigSecret.github_token="$GITHUB_TOKEN" \
--set maxRunners=10 \
--set minRunners=1
Use the runners in your workflow:
jobs:
build:
runs-on: arc-runner-set # Uses your self-hosted runners
steps:
- uses: actions/checkout@v4
- run: echo "Running on self-hosted runner"
When to Use Self-Hosted Runners
| Use Case | GitHub-Hosted | Self-Hosted |
|---|---|---|
| Standard CI/CD | Good | Overkill |
| Large Docker builds | Slow (2 cores) | Fast (custom specs) |
| GPU-accelerated tests | Not available | Required |
| Private network access | Not possible | VPC access |
| Cost at scale (>2000 min/mo) | Expensive | Cheaper |
| Custom tooling pre-installed | Not available | Full control |
Part 11: Troubleshooting Common Issues
Common Failures and Fixes
"Resource not accessible by integration": Your permissions block is missing a required scope.
# Common permission sets by use case
permissions:
contents: read # Checkout code
packages: write # Push to GHCR
pull-requests: write # Comment on PRs
security-events: write # Upload SARIF results
id-token: write # OIDC for cloud auth
Cache misses on every run: Your cache key is too specific. Use a broader restore-keys pattern:
- uses: actions/cache@v4
with:
path: node_modules
key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
Docker builds are slow: Enable BuildKit layer caching:
# syntax=docker/dockerfile:1
FROM node:22-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
FROM node:22-alpine
WORKDIR /app
COPY /app/node_modules ./node_modules
COPY . .
CMD ["node", "server.js"]
Concurrent deploys cause conflicts: Use concurrency groups to serialize deployments:
concurrency:
group: deploy-${{ github.ref }}
cancel-in-progress: false # Never cancel in-progress deploys
Secrets not available in forks: This is by design. For fork PRs, use pull_request_target with caution, or limit CI to non-secret-dependent steps:
on:
pull_request: # Safe: runs in fork context, no secrets
pull_request_target: # Dangerous: runs in base context, has secrets
types: [labeled] # Only trigger on explicit label
Only use pull_request_target when you've explicitly labeled a PR as safe. Running it on every fork PR is a security vulnerability — the PR author controls the code that runs with your secrets.
What I've Learned Building Hundreds of Pipelines
- Start simple, add complexity when it hurts. A 20-line workflow that runs tests is better than a 500-line monolith you can't debug.
- Pipeline time is developer time. If your CI takes 15 minutes, multiply that by every PR, every day. Optimize relentlessly.
- Treat workflows as code. Review them in PRs. Test them in staging. Version them.
- Never deploy what you haven't tested. If your integration tests are flaky, fix them — don't skip them.
- Secure your workflows from day one. Pin actions to SHAs, scope permissions minimally, and never store secrets in workflow files.
- Make failures actionable. Every failed step should tell the developer what went wrong and how to fix it. Upload relevant artifacts on failure so debugging doesn't require re-running the entire pipeline.
The best pipeline is one your team trusts. Build that trust by making it fast, reliable, and transparent. Every failure should tell the developer exactly what went wrong and how to fix it. That's the standard.
Pipeline Performance Benchmarks
After optimizing hundreds of pipelines, here are the targets I set for teams:
| Metric | Acceptable | Good | Excellent |
|---|---|---|---|
| PR CI time (lint + test) | < 15 min | < 8 min | < 4 min |
| Docker build (cached) | < 5 min | < 3 min | < 1 min |
| Staging deploy (after CI) | < 10 min | < 5 min | < 2 min |
| Total push-to-production | < 45 min | < 20 min | < 10 min |
| CI success rate | > 90% | > 95% | > 99% |
Track these over time. When your CI success rate drops below 95%, it means flaky tests are eroding trust. When PR CI time exceeds 10 minutes, developers start context-switching away from the PR, and review cycles slow down.
The fastest teams I've worked with deploy to production within 15 minutes of a merge to main. That speed comes from investment in test reliability, Docker layer caching, parallel jobs, and automated staging verification. Every minute you shave off the pipeline is a minute returned to every developer on every PR, every day. That compounding effect is why pipeline optimization has one of the highest ROI of any engineering investment.
Build your pipeline incrementally: start with tests that pass, add security scanning, add staged deployment, add automated rollback. Each layer adds confidence. And when something inevitably goes wrong, the pipeline's logs, artifacts, and deployment history give you the forensics to fix it fast.
Related Articles
CI/CD Engineering Lead
Automation evangelist who believes no deployment should require a human. I write pipelines, break pipelines, and write about both. Code-first, always.
Related Articles
GitHub Actions Matrix Builds for Multi-Platform Testing
Master GitHub Actions matrix builds to test across multiple OS versions, language versions, and configurations in parallel.
GitHub Actions Reusable Workflows and Composite Actions for DRY Pipelines
Eliminate duplicated CI/CD logic across repositories using GitHub Actions reusable workflows and composite actions with real-world examples.
Hardening GitHub Actions: Permissions, OIDC, and Pinned Actions
Harden GitHub Actions security with least-privilege permissions, OIDC federation, SHA-pinned actions, and secrets management best practices.