Production-Ready Helm Charts: Templates, Values, Hooks, and Testing
Most Helm Charts Are Not Production-Ready
Here's the thing about Helm charts in the wild — the vast majority of them work on a developer's laptop and crumble in production. I've inherited charts that hardcoded replica counts, had no resource limits, used latest as the default image tag, and exposed secrets in plaintext through values files.
A production-ready Helm chart is one that another engineer can deploy to a live cluster with confidence, customize for their environment without forking the chart, and upgrade without downtime. That bar is higher than most people realize.
Let me tell you why these patterns matter, and walk through the practices I enforce on every chart that touches production.
Chart Structure That Scales
Start with a clean layout. Every chart I build follows this structure:
my-app/
├── Chart.yaml
├── Chart.lock
├── values.yaml
├── values-production.yaml
├── values-staging.yaml
├── templates/
│ ├── _helpers.tpl
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── hpa.yaml
│ ├── pdb.yaml
│ ├── serviceaccount.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── networkpolicy.yaml
│ └── tests/
│ └── test-connection.yaml
├── charts/ # subcharts
└── ci/
└── ci-values.yaml # values used in CI testing
The ci/ directory is something most people skip. It holds a values file specifically for automated testing in your pipeline. More on that later.
Values Design: The API of Your Chart
Your values.yaml is an API contract. Treat it like one. Here's how I structure values for a typical web service:
# values.yaml
# -- Number of replicas. Override per environment.
replicaCount: 2
image:
# -- Container image repository
repository: ghcr.io/myorg/my-app
# -- Image pull policy
pullPolicy: IfNotPresent
# -- Image tag. Defaults to chart appVersion.
tag: ""
# -- Image pull secrets for private registries
imagePullSecrets: []
serviceAccount:
# -- Create a service account
create: true
# -- Annotations for the service account (e.g., IRSA)
annotations: {}
# -- Service account name. Auto-generated if not set.
name: ""
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: false
className: nginx
annotations: {}
hosts:
- host: my-app.example.com
paths:
- path: /
pathType: Prefix
tls: []
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: false
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
podDisruptionBudget:
enabled: true
minAvailable: 1
# -- Extra environment variables as key-value pairs
env: {}
# -- Extra environment variables from secrets/configmaps
envFrom: []
# -- Readiness probe configuration
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
# -- Liveness probe configuration
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 15
periodSeconds: 20
# -- Node selector constraints
nodeSelector: {}
# -- Tolerations for pod scheduling
tolerations: []
# -- Affinity rules for pod scheduling
affinity: {}
Let me tell you why every field has a comment with --. That double-dash prefix is a convention that helm-docs picks up to auto-generate documentation. If you're not generating docs from your values file, you're asking every consumer of your chart to read your templates to understand what's configurable.
Template Helpers That Prevent Disasters
Your _helpers.tpl should define reusable named templates. Here's the foundation I use:
{{/* templates/_helpers.tpl */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "my-app.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a fully qualified app name.
We truncate at 63 characters because Kubernetes name fields are limited.
*/}}
{{- define "my-app.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Common labels applied to every resource.
*/}}
{{- define "my-app.labels" -}}
helm.sh/chart: {{ include "my-app.chart" . }}
{{ include "my-app.selectorLabels" . }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector labels — used in deployments and services.
*/}}
{{- define "my-app.selectorLabels" -}}
app.kubernetes.io/name: {{ include "my-app.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
Chart name and version for the chart label.
*/}}
{{- define "my-app.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Service account name.
*/}}
{{- define "my-app.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "my-app.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}
The 63-character truncation is not optional. Kubernetes rejects names longer than 63 characters, and when your release name is staging-my-long-application-name, that limit comes fast. I've watched deployments fail in CI because nobody tested with long release names.
The Deployment Template Done Right
Here's a deployment template with the patterns I consider mandatory:
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "my-app.fullname" . }}
labels:
{{- include "my-app.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "my-app.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
labels:
{{- include "my-app.labels" . | nindent 8 }}
spec:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "my-app.serviceAccountName" . }}
securityContext:
runAsNonRoot: true
fsGroup: 65534
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
protocol: TCP
{{- with .Values.readinessProbe }}
readinessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.livenessProbe }}
livenessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
{{- if .Values.env }}
env:
{{- range $key, $value := .Values.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
{{- end }}
{{- with .Values.envFrom }}
envFrom:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
Here's the thing about that checksum/config annotation — it forces a rolling restart when your ConfigMap changes. Without it, you update a config value, Helm reports success, and your pods keep running with the old config because the Deployment spec itself didn't change. I've seen this cause hours of confusion.
Also note the security context: runAsNonRoot, readOnlyRootFilesystem, and dropping all capabilities. These should be defaults, not opt-in.
Helm Hooks for Lifecycle Management
Hooks let you run actions at specific points in the release lifecycle. Here's a database migration hook that runs before upgrades:
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "my-app.fullname" . }}-migrate
labels:
{{- include "my-app.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-1"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
backoffLimit: 3
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
command: ["./migrate", "--direction", "up"]
envFrom:
- secretRef:
name: {{ include "my-app.fullname" . }}-db-credentials
Let me tell you why hook-delete-policy is critical. Without before-hook-creation, if a previous migration Job still exists (maybe it failed), the new hook can't create a Job with the same name and the entire upgrade hangs. I've been paged for exactly this scenario.
The hook-weight controls ordering when you have multiple hooks. Lower numbers run first. Use negative weights for migrations that must complete before other setup hooks.
Testing Your Charts
Helm has a built-in test framework that almost nobody uses. Add test pods in templates/tests/:
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "my-app.fullname" . }}-test-connection"
labels:
{{- include "my-app.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": test
spec:
restartPolicy: Never
containers:
- name: wget
image: busybox
command: ['wget']
args: ['{{ include "my-app.fullname" . }}:{{ .Values.service.port }}/health']
Run tests after install:
helm test my-release -n production
But in-cluster tests are only one layer. For CI, I also run:
# Lint the chart
helm lint ./my-app --values ./my-app/ci/ci-values.yaml
# Template rendering — catches syntax errors without a cluster
helm template test-release ./my-app --values ./my-app/ci/ci-values.yaml > rendered.yaml
# Validate rendered manifests against Kubernetes schemas
kubeconform -strict -kubernetes-version 1.29.0 rendered.yaml
# Policy checks with conftest
conftest test rendered.yaml --policy ./policies/
This pipeline catches the majority of issues before anything touches a cluster. The ci-values.yaml file should enable every feature toggle so your templates get fully rendered and tested.
Patterns to Avoid
After maintaining charts across many teams, these are the anti-patterns I push back on:
-
Defaulting image tag to
latest. Use.Chart.AppVersionas the default. Pinned versions are non-negotiable for reproducible deployments. -
Putting secrets in
values.yaml. Secrets belong in external secret managers (Vault, AWS Secrets Manager) referenced viaenvFromor external-secrets-operator. Never check credentials into a chart. -
Massive monolithic templates. If a template file exceeds 150 lines, split it. Use named templates in
_helpers.tplfor repeated blocks. -
No resource requests or limits. A chart without resource definitions will get scheduled on nodes that can't handle it, or worse, it'll consume unbounded resources and starve other workloads.
-
Skipping PodDisruptionBudgets. If you care about availability during node drains and cluster upgrades, a PDB is mandatory. Default
minAvailable: 1for any multi-replica workload.
Final Thoughts
A Helm chart is the interface between your application and the cluster. It encodes your operational knowledge: how the app should be deployed, what resources it needs, how it scales, and what happens during upgrades.
Treat your charts with the same rigor as application code. Review them in PRs, test them in CI, version them properly. The chart that works on your laptop and the chart that survives a production node failure at 3 AM are very different things. Build for the 3 AM scenario, and the laptop scenario takes care of itself.
Related Articles
Senior Kubernetes Architect
10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.
Related Articles
The Complete Guide to Kubernetes Deployment Strategies: Rolling, Blue-Green, Canary, and Progressive Delivery
A comprehensive guide to every Kubernetes deployment strategy — rolling updates, blue-green, canary, and progressive delivery with Argo Rollouts and Flagger.
Kubernetes Ingress vs Gateway API: When to Migrate and How to Do It Without Breaking Everything
A practical comparison of Kubernetes Ingress and Gateway API, with a migration strategy that won't take down your production traffic.
Kubernetes Resource Requests vs Limits: The Guide I Wish I Had Before My First OOM Kill
A deep dive into Kubernetes resource requests, limits, QoS classes, and why getting them wrong leads to OOM kills, throttling, and wasted money.