DevOpsil
CI/CD
88%
Fresh
Part 4 of 6 in CI/CD Mastery

GitLab CI Pipeline Optimization: Caching, DAG, and Parallel Jobs

Sarah ChenSarah Chen8 min read

The Fast Pipeline First

Here's the optimized .gitlab-ci.yml. Then I'll show you what each piece saves.

stages:
  - build
  - test
  - deploy

variables:
  DOCKER_BUILDKIT: "1"
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
  NPM_CONFIG_CACHE: "$CI_PROJECT_DIR/.cache/npm"

.default_cache: &default_cache
  key:
    files:
      - package-lock.json
      - requirements.txt
  paths:
    - .cache/
    - node_modules/
    - .venv/
  policy: pull

build-frontend:
  stage: build
  cache:
    <<: *default_cache
    policy: pull-push
  script:
    - npm ci --prefer-offline
    - npm run build
  artifacts:
    paths:
      - dist/
    expire_in: 1 hour

build-backend:
  stage: build
  cache:
    <<: *default_cache
    policy: pull-push
  script:
    - python -m venv .venv
    - source .venv/bin/activate
    - pip install -r requirements.txt
    - python setup.py build
  artifacts:
    paths:
      - build/
    expire_in: 1 hour

lint:
  stage: test
  needs: ["build-frontend"]
  cache:
    <<: *default_cache
  script:
    - npm run lint
    - npm run type-check

unit-tests:
  stage: test
  needs: ["build-backend"]
  cache:
    <<: *default_cache
  parallel: 4
  script:
    - source .venv/bin/activate
    - pip install -r requirements.txt
    - python -m pytest tests/unit/ --splits 4 --group $CI_NODE_INDEX
  artifacts:
    reports:
      junit: report.xml

integration-tests:
  stage: test
  needs: ["build-backend", "build-frontend"]
  services:
    - postgres:15
    - redis:7
  variables:
    POSTGRES_DB: testdb
    POSTGRES_PASSWORD: testpass
  script:
    - source .venv/bin/activate
    - pip install -r requirements.txt
    - python -m pytest tests/integration/ -x

e2e-tests:
  stage: test
  needs: ["build-frontend"]
  parallel: 3
  script:
    - npm ci --prefer-offline
    - npx playwright install --with-deps chromium
    - npx playwright test --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL

deploy-staging:
  stage: deploy
  needs: ["unit-tests", "integration-tests", "e2e-tests"]
  script:
    - ./deploy.sh staging
  environment:
    name: staging
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

That pipeline runs in under 6 minutes. The unoptimized version took 25. Here's why.

Caching: Stop Re-downloading the Internet

Every pipeline run without caching downloads every dependency from scratch. That's insane.

The key.files directive hashes your lockfiles. Same lockfile, same cache. Change a dependency, cache busts automatically.

cache:
  key:
    files:
      - package-lock.json
  paths:
    - node_modules/
  policy: pull-push

Three cache policies matter:

  • pull-push — Read and write cache. Use on build jobs that populate the cache.
  • pull — Read-only. Use on test jobs that consume the cache. Prevents cache corruption from parallel writes.
  • push — Write-only. Rare. Used for cache warming jobs.

One rule: only one job should pull-push per cache key. Multiple writers cause race conditions. Every other job gets pull.

DAG: needs Keyword Kills Idle Time

Default GitLab CI waits for the entire previous stage to finish before starting the next one. The needs keyword breaks that wall.

lint:
  stage: test
  needs: ["build-frontend"]  # starts as soon as build-frontend finishes

integration-tests:
  stage: test
  needs: ["build-backend", "build-frontend"]  # waits for both

Without needs, lint waits for build-backend too. That's wasted time. DAG dependencies let jobs start the instant their actual dependencies complete.

Visualize it. Without DAG:

build-frontend ──┐
                  ├── (wait for both) ── lint, unit-tests, integration-tests, e2e
build-backend  ──┘

With DAG:

build-frontend ── lint (starts immediately)
               ── e2e-tests (starts immediately)
build-backend  ── unit-tests (starts immediately)
both done      ── integration-tests

You just saved 3-5 minutes depending on build times.

Parallel Test Splitting

Tests are the bottleneck. Split them.

unit-tests:
  parallel: 4
  script:
    - python -m pytest tests/unit/ --splits 4 --group $CI_NODE_INDEX

GitLab spawns 4 runners. $CI_NODE_INDEX tells each runner which chunk to run. $CI_NODE_TOTAL gives you the total count.

For Playwright or Cypress, sharding is built in:

e2e-tests:
  parallel: 3
  script:
    - npx playwright test --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL

200 E2E tests across 3 runners. Each runs ~67. Wall time drops by 3x.

Want smarter splitting? Use pytest-split with timing data:

unit-tests:
  parallel: 4
  script:
    - python -m pytest tests/unit/ --splits 4 --group $CI_NODE_INDEX --splitting-algorithm least_duration
  artifacts:
    paths:
      - .test_durations

It learns which tests are slow and distributes them evenly. No more one runner finishing in 30 seconds while another grinds for 4 minutes.

Artifacts: Pass Data, Not Re-build

Build once. Share everywhere.

build-frontend:
  artifacts:
    paths:
      - dist/
    expire_in: 1 hour

Set expire_in. Always. Default artifact retention fills your storage fast. One hour is enough for pipeline artifacts. Bump to 30 days for release builds.

Use needs to pull artifacts from specific jobs instead of downloading everything:

deploy-staging:
  needs:
    - job: build-frontend
      artifacts: true
    - job: unit-tests
      artifacts: false  # only need the dependency, not the test reports

Rules Over only/except

Stop using only and except. They're deprecated in spirit if not in docs.

deploy-staging:
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: on_success
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: manual
    - when: never

Rules are evaluated top to bottom. First match wins. Always end with when: never as a catch-all.

Docker-in-Docker vs. Kaniko

Building Docker images inside GitLab CI is a common bottleneck. Docker-in-Docker (DinD) is the default, but Kaniko is faster and more secure.

Docker-in-Docker (Slow, Requires Privileged)

build-image:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

DinD requires privileged: true on the runner. That's a security risk. It also starts a Docker daemon for every job — 10-15 seconds of overhead.

Kaniko (Faster, No Privileges)

build-image:
  stage: build
  image:
    name: gcr.io/kaniko-project/executor:v1.22.0-debug
    entrypoint: [""]
  script:
    - /kaniko/executor
        --context $CI_PROJECT_DIR
        --dockerfile $CI_PROJECT_DIR/Dockerfile
        --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
        --cache=true
        --cache-repo=$CI_REGISTRY_IMAGE/cache

Kaniko runs in userspace. No Docker daemon. No privileged mode. The --cache=true flag caches layers in your registry, so subsequent builds only rebuild changed layers. This alone saves 2-5 minutes per build.

Runner Configuration for Speed

Your runner setup matters as much as your .gitlab-ci.yml. Here are the optimizations that make the biggest difference.

Use Local Caches on Dedicated Runners

If you're running your own GitLab runners, local caching outperforms S3/GCS caches:

# /etc/gitlab-runner/config.toml
[[runners]]
  [runners.cache]
    Type = "local"
    Path = "/opt/gitlab-runner/cache"
    Shared = true

  [runners.docker]
    pull_policy = ["if-not-present"]
    volumes = ["/opt/gitlab-runner/cache:/cache"]

pull_policy = "if-not-present" skips pulling images that already exist on the runner. For a custom base image used by every job, this saves 15-30 seconds per job.

Autoscaling Runners for Peak Hours

[[runners]]
  [runners.machine]
    IdleCount = 2
    IdleTime = 600
    MaxBuilds = 100
    MachineDriver = "amazonec2"
    MachineName = "gitlab-runner-%s"
    MachineOptions = [
      "amazonec2-instance-type=c5.2xlarge",
      "amazonec2-region=us-east-1",
      "amazonec2-spot-instance=true",
      "amazonec2-spot-price=0.15"
    ]

Spot instances for CI runners save 60-70% over on-demand pricing. CI workloads are interruptible by design — a spot termination just retries the job.

Reducing Image Pull Times

Every job starts by pulling a Docker image. For large images, this dominates the pipeline.

variables:
  # Use a lightweight base image
  DEFAULT_IMAGE: alpine:3.19  # 7 MB vs. ubuntu:22.04 at 77 MB

# Pre-build custom images with your dependencies baked in
.python-base:
  image: $CI_REGISTRY_IMAGE/ci-python:3.12
  # This image contains: python 3.12, pip, pytest, ruff
  # Built weekly by a scheduled pipeline

Bake your CI dependencies into a custom image. Instead of running pip install in every job, pull an image that already has everything installed. The one-time cost of maintaining the image pays for itself across hundreds of pipeline runs.

# Scheduled weekly: rebuild CI base images
rebuild-ci-images:
  stage: build
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
  script:
    - docker build -t $CI_REGISTRY_IMAGE/ci-python:3.12 -f ci/Dockerfile.python .
    - docker push $CI_REGISTRY_IMAGE/ci-python:3.12

Interruptible Jobs

When a new commit is pushed to the same branch, previous pipeline runs are wasted work. Mark jobs as interruptible:

default:
  interruptible: true

deploy-staging:
  stage: deploy
  interruptible: false  # Never cancel a deployment mid-way

Auto-cancel superseded pipelines in project settings under CI/CD > General pipelines. This prevents queue buildup during active development.

The Scorecard

OptimizationTime Saved
Caching dependencies~4 min
DAG with needs~3 min
Parallel test splitting (4x)~8 min
Artifact reuse~2 min
Smaller Docker images~2 min
Kaniko vs. DinD~2 min
Custom CI base images~1 min
Total~22 min

From 25 minutes to 3-6. That's not a nice-to-have. That's the difference between developers waiting for CI and developers shipping code.

Common Pitfalls

Caching node_modules directly. Cache the npm/yarn/pip cache directory, not node_modules/. Direct node_modules caching can lead to stale or incompatible dependencies. npm ci --prefer-offline with a cached .cache/npm/ is safer.

Running all tests on every push. Feature branch pushes don't need the full E2E suite. Use rules to run expensive tests only on merge requests and main:

e2e-tests:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"
    - when: never

Not setting artifact expiry. Default retention fills your storage. Artifacts from feature branch pipelines should expire in 1-3 days. Release artifacts in 30 days.

Ignoring pipeline metrics. Without data, you can't prove your optimizations work. Track pipeline duration over time.

One More Thing

Add this to your pipeline to track optimization over time:

pipeline-metrics:
  stage: .post
  script:
    - echo "Pipeline duration ${CI_PIPELINE_DURATION}s"
    - 'curl -s -X POST "$METRICS_URL/api/v1/import/prometheus" --data-binary "gitlab_pipeline_duration_seconds{project=\"$CI_PROJECT_NAME\"} $CI_PIPELINE_DURATION"'
  when: always

If it's not measured, it drifts. Ship it, track it, keep it fast.

Share:
Sarah Chen
Sarah Chen

CI/CD Engineering Lead

Automation evangelist who believes no deployment should require a human. I write pipelines, break pipelines, and write about both. Code-first, always.

Related Articles