DevOpsil
CI/CD
84%
Fresh
Part 1 of 6 in CI/CD Mastery

The Complete Guide to GitHub Actions CI/CD: From Zero to Production-Ready Pipelines

Sarah ChenSarah Chen15 min read

Why This Guide Exists

I've built CI/CD pipelines at every scale — from solo projects to teams shipping 200+ deploys a day. The pattern is always the same: someone writes a workflow that works, then copy-pastes it across repos until it becomes an unmaintainable mess.

This guide is the pipeline playbook I wish I'd had on day one. We're going from a blank repo to a production-grade CI/CD system with testing, security scanning, staged deployments, and rollback capability. No skipping steps.

If you've never written a GitHub Actions workflow, start at the beginning. If you're already comfortable, skip to the advanced patterns. Either way, every code block here has been battle-tested in production.

Part 1: Understanding the Fundamentals

Anatomy of a Workflow

Every GitHub Actions workflow lives in .github/workflows/ and follows this structure:

name: CI Pipeline
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
  workflow_dispatch:  # Manual trigger

permissions:
  contents: read
  packages: write

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run tests
        run: echo "Tests go here"

Key concepts:

  • on defines triggers. Every workflow needs at least one.
  • permissions follows least privilege. Never use permissions: write-all.
  • env sets workflow-level environment variables.
  • jobs run in parallel by default. Use needs for dependencies.

Event Triggers That Actually Matter

Most teams only need these triggers:

on:
  push:
    branches: [main]
    paths-ignore:
      - '**.md'
      - 'docs/**'
  pull_request:
    types: [opened, synchronize, reopened]
  release:
    types: [published]
  schedule:
    - cron: '0 6 * * 1'  # Monday 6 AM UTC
  workflow_dispatch:
    inputs:
      environment:
        description: 'Target environment'
        required: true
        type: choice
        options: [staging, production]

The paths-ignore filter is crucial. Nobody needs a full CI run because you fixed a typo in the README.

Part 2: Building the Test Pipeline

Step 1: Lint, Type Check, Unit Test

This is your first line of defense. Every PR runs through this.

name: CI
on:
  pull_request:
  push:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      fail-fast: false
      matrix:
        node: [20, 22]
        shard: [1, 2, 3]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ matrix.shard }}/3
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: test-results-${{ matrix.node }}-${{ matrix.shard }}
          path: coverage/
          retention-days: 7

Notice the test sharding. Three shards running in parallel means your 9-minute test suite finishes in 3 minutes. The fail-fast: false ensures all shards complete so you see every failure, not just the first one.

Step 2: Integration Tests with Services

Real applications need databases. GitHub Actions supports service containers natively.

  integration:
    runs-on: ubuntu-latest
    needs: lint
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        ports:
          - 6379:6379
        options: --health-cmd "redis-cli ping" --health-interval 10s
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'npm'
      - run: npm ci
      - run: npm run test:integration
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379

The options block is important. Without health checks, your tests might start before Postgres is ready and fail with connection errors. I see this mistake constantly.

Part 3: Building and Publishing Container Images

Multi-Platform Builds with Caching

  build:
    runs-on: ubuntu-latest
    needs: [test, integration]
    permissions:
      contents: read
      packages: write
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
      image-tag: ${{ steps.meta.outputs.version }}
    steps:
      - uses: actions/checkout@v4

      - uses: docker/setup-qemu-action@v3
      - uses: docker/setup-buildx-action@v3

      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/${{ github.repository }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}

      - id: build
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          platforms: linux/amd64,linux/arm64
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          provenance: true
          sbom: true

The cache-from: type=gha uses GitHub's built-in cache. This takes a 12-minute Docker build down to 2 minutes on subsequent runs. The provenance and sbom flags generate supply chain attestations — you'll need these for compliance.

Part 4: Deployment Pipeline with Environments

Staged Deployments with Approval Gates

This is where most teams stop at "push to main = deploy to prod." Don't do that. Use environments.

  deploy-staging:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          kubectl set image deployment/app \
            app=ghcr.io/${{ github.repository }}:${{ needs.build.outputs.image-tag }} \
            --namespace staging
        env:
          KUBECONFIG_DATA: ${{ secrets.STAGING_KUBECONFIG }}

      - name: Run smoke tests
        run: |
          for i in {1..30}; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://staging.example.com/health)
            if [ "$STATUS" = "200" ]; then
              echo "Staging is healthy"
              exit 0
            fi
            echo "Waiting for staging... (attempt $i)"
            sleep 10
          done
          echo "Staging health check failed"
          exit 1

  deploy-production:
    runs-on: ubuntu-latest
    needs: [build, deploy-staging]
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://example.com
    concurrency:
      group: production
      cancel-in-progress: false
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: |
          kubectl set image deployment/app \
            app=ghcr.io/${{ github.repository }}:${{ needs.build.outputs.image-tag }} \
            --namespace production
        env:
          KUBECONFIG_DATA: ${{ secrets.PROD_KUBECONFIG }}

      - name: Verify deployment
        run: |
          kubectl rollout status deployment/app \
            --namespace production \
            --timeout=300s

Configure the production environment in GitHub Settings with:

  • Required reviewers: At least one team lead must approve.
  • Wait timer: 5-minute delay so you can cancel if something looks wrong.
  • Branch restrictions: Only main can deploy.

The concurrency block prevents two production deployments from running simultaneously. Never set cancel-in-progress: true for production — you don't want to interrupt a running deployment.

Part 5: Reusable Workflows

DRY Pipelines Across Repositories

When you have 15 repos running the same pipeline, copy-paste is a liability. Reusable workflows fix this.

Create a shared workflow in a central repo:

# .github/workflows/node-ci.yml (in your shared-workflows repo)
name: Reusable Node.js CI
on:
  workflow_call:
    inputs:
      node-version:
        type: string
        default: '22'
      working-directory:
        type: string
        default: '.'
      run-integration:
        type: boolean
        default: false
    secrets:
      NPM_TOKEN:
        required: false

jobs:
  ci:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ${{ inputs.working-directory }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
          cache: 'npm'
          cache-dependency-path: ${{ inputs.working-directory }}/package-lock.json
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check
      - run: npm test
      - if: inputs.run-integration
        run: npm run test:integration

Consume it from any repo:

# .github/workflows/ci.yml (in any consuming repo)
name: CI
on:
  pull_request:
  push:
    branches: [main]

jobs:
  ci:
    uses: my-org/shared-workflows/.github/workflows/node-ci.yml@v2
    with:
      node-version: '22'
      run-integration: true
    secrets:
      NPM_TOKEN: ${{ secrets.NPM_TOKEN }}

Version your shared workflows with tags (@v2), not branch references. When you push a breaking change, consuming repos don't break until they explicitly upgrade.

Part 6: Security Hardening

Locking Down Your Pipeline

Every step that runs in your pipeline is code you're executing in a trusted environment. Treat it accordingly.

name: Secure CI
on:
  pull_request:

permissions:
  contents: read

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Pin actions to full SHA, not tags
      - uses: actions/setup-node@1d0ff469b7ec7b3cb9d8673fde0c81c44821de2a  # v4.2.0
        with:
          node-version: 22

      # Dependency review on PRs
      - uses: actions/dependency-review-action@67d4f4e7a7a09e53d4baa05862b1e8b1c0338296  # v4.6.0
        if: github.event_name == 'pull_request'

      # SAST scanning
      - uses: github/codeql-action/init@v3
        with:
          languages: javascript
      - uses: github/codeql-action/analyze@v3

      # Secret scanning
      - uses: trufflesecurity/trufflehog@main
        with:
          extra_args: --only-verified

Key rules:

  1. Pin actions to commit SHAs, not version tags. Tags can be moved.
  2. Set permissions at the workflow level, not just job level.
  3. Never echo secrets in run steps, even for debugging.
  4. Use GITHUB_TOKEN over PATs wherever possible — it's automatically scoped.

Branch Protection Rules

Your pipeline means nothing without branch protection:

# Set via GitHub CLI
gh api repos/{owner}/{repo}/branches/main/protection \
  --method PUT \
  --field required_status_checks='{"strict":true,"contexts":["ci","security"]}' \
  --field enforce_admins=true \
  --field required_pull_request_reviews='{"required_approving_review_count":1,"dismiss_stale_reviews":true}' \
  --field restrictions=null

Part 7: Monitoring and Optimization

Track Pipeline Performance

Slow pipelines kill developer productivity. Measure and optimize.

  report-metrics:
    runs-on: ubuntu-latest
    if: always()
    needs: [lint, test, integration, build]
    steps:
      - name: Calculate pipeline duration
        run: |
          echo "## Pipeline Summary" >> $GITHUB_STEP_SUMMARY
          echo "| Job | Status |" >> $GITHUB_STEP_SUMMARY
          echo "|---|---|" >> $GITHUB_STEP_SUMMARY
          echo "| Lint | ${{ needs.lint.result }} |" >> $GITHUB_STEP_SUMMARY
          echo "| Test | ${{ needs.test.result }} |" >> $GITHUB_STEP_SUMMARY
          echo "| Integration | ${{ needs.integration.result }} |" >> $GITHUB_STEP_SUMMARY
          echo "| Build | ${{ needs.build.result }} |" >> $GITHUB_STEP_SUMMARY

Caching Strategy

Cache aggressively. Here's a comprehensive caching setup:

      - uses: actions/cache@v4
        with:
          path: |
            ~/.npm
            node_modules
            .next/cache
          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-node-

      - uses: actions/cache@v4
        with:
          path: /tmp/.buildx-cache
          key: ${{ runner.os }}-buildx-${{ github.sha }}
          restore-keys: |
            ${{ runner.os }}-buildx-

The Complete Pipeline

Here's everything assembled into one production-ready workflow. Copy it, customize it, ship it.

name: Production CI/CD
on:
  push:
    branches: [main]
  pull_request:
  release:
    types: [published]

permissions:
  contents: read
  packages: write

jobs:
  ci:
    uses: ./.github/workflows/node-ci.yml
    with:
      run-integration: ${{ github.event_name == 'push' }}

  security:
    uses: ./.github/workflows/security-scan.yml
    permissions:
      security-events: write
      contents: read

  build:
    needs: [ci, security]
    if: github.ref == 'refs/heads/main' || github.event_name == 'release'
    uses: ./.github/workflows/docker-build.yml
    permissions:
      packages: write

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/main'
    uses: ./.github/workflows/deploy.yml
    with:
      environment: staging
    secrets: inherit

  deploy-production:
    needs: deploy-staging
    if: github.event_name == 'release'
    uses: ./.github/workflows/deploy.yml
    with:
      environment: production
    secrets: inherit

Part 8: Rollback Strategies

Deploying is only half the story. What happens when the deploy is bad?

Automated Rollback on Failed Health Check

  deploy-production:
    runs-on: ubuntu-latest
    needs: [build, deploy-staging]
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Save current revision for rollback
        id: current
        run: |
          CURRENT_IMAGE=$(kubectl get deployment/app -n production \
            -o jsonpath='{.spec.template.spec.containers[0].image}')
          echo "image=$CURRENT_IMAGE" >> "$GITHUB_OUTPUT"

      - name: Deploy new version
        run: |
          kubectl set image deployment/app \
            app=ghcr.io/${{ github.repository }}:${{ needs.build.outputs.image-tag }} \
            --namespace production
          kubectl rollout status deployment/app \
            --namespace production --timeout=300s

      - name: Post-deploy verification
        id: verify
        continue-on-error: true
        run: |
          sleep 30
          for i in {1..10}; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health)
            if [ "$STATUS" != "200" ]; then
              echo "Health check failed with status $STATUS"
              exit 1
            fi
            RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" https://api.example.com/health)
            if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
              echo "Response time ${RESPONSE_TIME}s exceeds 2s threshold"
              exit 1
            fi
          done

      - name: Rollback on failure
        if: steps.verify.outcome == 'failure'
        run: |
          echo "Verification failed. Rolling back to ${{ steps.current.outputs.image }}"
          kubectl set image deployment/app \
            app=${{ steps.current.outputs.image }} \
            --namespace production
          kubectl rollout status deployment/app \
            --namespace production --timeout=300s
          exit 1

Part 9: Advanced Patterns

Path-Based Conditional Jobs for Monorepos

In a monorepo, you don't want to build everything when only one service changed:

name: Monorepo CI
on:
  pull_request:

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      api: ${{ steps.filter.outputs.api }}
      web: ${{ steps.filter.outputs.web }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            api:
              - 'services/api/**'
              - 'packages/shared/**'
            web:
              - 'services/web/**'
              - 'packages/shared/**'

  test-api:
    needs: detect-changes
    if: needs.detect-changes.outputs.api == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: cd services/api && npm ci && npm test

  test-web:
    needs: detect-changes
    if: needs.detect-changes.outputs.web == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: cd services/web && npm ci && npm test

The packages/shared/** path in both filters ensures shared library changes trigger tests for all consumers.

Composite Actions for Shared Steps

When multiple workflows need the same setup, composite actions keep things DRY:

# .github/actions/setup-node-project/action.yml
name: Setup Node Project
description: Install Node.js, cache dependencies, install packages
inputs:
  node-version:
    description: Node.js version
    default: '22'
  working-directory:
    description: Working directory
    default: '.'

runs:
  using: composite
  steps:
    - uses: actions/setup-node@v4
      with:
        node-version: ${{ inputs.node-version }}
        cache: npm
        cache-dependency-path: ${{ inputs.working-directory }}/package-lock.json
    - shell: bash
      working-directory: ${{ inputs.working-directory }}
      run: npm ci --prefer-offline

Workflow Dispatch for Manual Operations

Build operational runbooks as workflows:

name: Database Migration
on:
  workflow_dispatch:
    inputs:
      environment:
        description: Target environment
        required: true
        type: choice
        options: [staging, production]
      dry-run:
        description: Dry run only
        required: true
        type: boolean
        default: true

jobs:
  migrate:
    runs-on: ubuntu-latest
    environment: ${{ inputs.environment }}
    steps:
      - uses: actions/checkout@v4
      - name: Run migration
        run: |
          if [ "${{ inputs.dry-run }}" = "true" ]; then
            echo "DRY RUN — showing pending migrations"
            npx prisma migrate status
          else
            npx prisma migrate deploy
          fi
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}

Part 10: Self-Hosted Runners

GitHub-hosted runners are convenient but have limitations: 2-core machines, limited disk space, no GPU support, and you can't install custom tooling. For teams that need more, self-hosted runners are the answer.

Setting Up Self-Hosted Runners on Kubernetes

Use the Actions Runner Controller (ARC) to manage ephemeral runners in your cluster:

# Install ARC with Helm
helm install arc \
  --namespace arc-systems \
  --create-namespace \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller

# Create a runner scale set
helm install arc-runner-set \
  --namespace arc-runners \
  --create-namespace \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --set githubConfigUrl="https://github.com/myorg" \
  --set githubConfigSecret.github_token="$GITHUB_TOKEN" \
  --set maxRunners=10 \
  --set minRunners=1

Use the runners in your workflow:

jobs:
  build:
    runs-on: arc-runner-set  # Uses your self-hosted runners
    steps:
      - uses: actions/checkout@v4
      - run: echo "Running on self-hosted runner"

When to Use Self-Hosted Runners

Use CaseGitHub-HostedSelf-Hosted
Standard CI/CDGoodOverkill
Large Docker buildsSlow (2 cores)Fast (custom specs)
GPU-accelerated testsNot availableRequired
Private network accessNot possibleVPC access
Cost at scale (>2000 min/mo)ExpensiveCheaper
Custom tooling pre-installedNot availableFull control

Part 11: Troubleshooting Common Issues

Common Failures and Fixes

"Resource not accessible by integration": Your permissions block is missing a required scope.

# Common permission sets by use case
permissions:
  contents: read          # Checkout code
  packages: write         # Push to GHCR
  pull-requests: write    # Comment on PRs
  security-events: write  # Upload SARIF results
  id-token: write         # OIDC for cloud auth

Cache misses on every run: Your cache key is too specific. Use a broader restore-keys pattern:

- uses: actions/cache@v4
  with:
    path: node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-node-

Docker builds are slow: Enable BuildKit layer caching:

# syntax=docker/dockerfile:1
FROM node:22-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm npm ci --omit=dev

FROM node:22-alpine
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
CMD ["node", "server.js"]

Concurrent deploys cause conflicts: Use concurrency groups to serialize deployments:

concurrency:
  group: deploy-${{ github.ref }}
  cancel-in-progress: false  # Never cancel in-progress deploys

Secrets not available in forks: This is by design. For fork PRs, use pull_request_target with caution, or limit CI to non-secret-dependent steps:

on:
  pull_request:      # Safe: runs in fork context, no secrets
  pull_request_target:  # Dangerous: runs in base context, has secrets
    types: [labeled]     # Only trigger on explicit label

Only use pull_request_target when you've explicitly labeled a PR as safe. Running it on every fork PR is a security vulnerability — the PR author controls the code that runs with your secrets.

What I've Learned Building Hundreds of Pipelines

  1. Start simple, add complexity when it hurts. A 20-line workflow that runs tests is better than a 500-line monolith you can't debug.
  2. Pipeline time is developer time. If your CI takes 15 minutes, multiply that by every PR, every day. Optimize relentlessly.
  3. Treat workflows as code. Review them in PRs. Test them in staging. Version them.
  4. Never deploy what you haven't tested. If your integration tests are flaky, fix them — don't skip them.
  5. Secure your workflows from day one. Pin actions to SHAs, scope permissions minimally, and never store secrets in workflow files.
  6. Make failures actionable. Every failed step should tell the developer what went wrong and how to fix it. Upload relevant artifacts on failure so debugging doesn't require re-running the entire pipeline.

The best pipeline is one your team trusts. Build that trust by making it fast, reliable, and transparent. Every failure should tell the developer exactly what went wrong and how to fix it. That's the standard.

Pipeline Performance Benchmarks

After optimizing hundreds of pipelines, here are the targets I set for teams:

MetricAcceptableGoodExcellent
PR CI time (lint + test)< 15 min< 8 min< 4 min
Docker build (cached)< 5 min< 3 min< 1 min
Staging deploy (after CI)< 10 min< 5 min< 2 min
Total push-to-production< 45 min< 20 min< 10 min
CI success rate> 90%> 95%> 99%

Track these over time. When your CI success rate drops below 95%, it means flaky tests are eroding trust. When PR CI time exceeds 10 minutes, developers start context-switching away from the PR, and review cycles slow down.

The fastest teams I've worked with deploy to production within 15 minutes of a merge to main. That speed comes from investment in test reliability, Docker layer caching, parallel jobs, and automated staging verification. Every minute you shave off the pipeline is a minute returned to every developer on every PR, every day. That compounding effect is why pipeline optimization has one of the highest ROI of any engineering investment.

Build your pipeline incrementally: start with tests that pass, add security scanning, add staged deployment, add automated rollback. Each layer adds confidence. And when something inevitably goes wrong, the pipeline's logs, artifacts, and deployment history give you the forensics to fix it fast.

Share:
Sarah Chen
Sarah Chen

CI/CD Engineering Lead

Automation evangelist who believes no deployment should require a human. I write pipelines, break pipelines, and write about both. Code-first, always.

Related Articles