Skip to main content

Batch Jobs: Protecting Long-Running Workloads from Disruption

The problem

Karpenter (and EKS Auto Mode) continuously consolidates underutilized nodes. This is great for cost optimization, but catastrophic for long-running batch jobs. A 6-hour ML training run evicted at hour 5 wastes 5 hours of GPU compute. An ETL pipeline disrupted mid-write can leave data in an inconsistent state.

Without protection, consolidation treats your 8-hour training job the same as a stateless web server -- just another pod to reschedule.

Prerequisites

Cluster deployed and kubectl configured per Quick Start.

How karpenter.sh/do-not-disrupt works

Adding this annotation to a pod's metadata tells Auto Mode: "do not voluntarily evict this pod for consolidation or drift remediation."

metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"

When this annotation is present on any pod running on a node, that entire node becomes protected from voluntary disruption. The node will not be consolidated, drifted, or removed for emptiness as long as the annotated pod is running.

Scope of protection

Disruption TypeProtected?Example
Consolidation (underutilized)YesKarpenter wants to bin-pack pods onto fewer nodes
Drift remediationYesAMI updated, Karpenter wants to roll nodes
Empty node removalYesAll other pods drained, but annotated pod remains
Spot interruptionNoAWS reclaims the instance with 2-min warning
Node health failureNoEC2 status check fails
Manual kubectl drainNoHuman or automation explicitly drains

Key insight: this protects against the scheduler's optimization decisions, not against infrastructure failures. For Spot protection, use on-demand instances. For health failures, implement checkpointing.

Why annotation vs taint

These solve different problems:

  • Taints control which pods CAN schedule onto a node (admission control)
  • do-not-disrupt controls whether a node with this pod CAN be consolidated (eviction control)

A GPU taint prevents CPU pods from landing on GPU nodes. do-not-disrupt prevents Karpenter from evicting your training job to consolidate that GPU node.

When to use

  • ML training jobs (hours to days)
  • ETL pipelines with expensive restart costs
  • Video transcoding (long-running, stateful progress)
  • Database migrations or backfills
  • Any batch workload where: restart cost > idle node cost

Deploy

Deploy the batch job:

kubectl apply -f batch-training-job.yaml

Verify the job is running:

kubectl get jobs -n batch-jobs
kubectl get pods -n batch-jobs -o wide

What to observe

Confirm the annotation is on the running pod:

kubectl get pod -n batch-jobs -l app=ml-training -o jsonpath='{.items[0].metadata.annotations}'

Identify which node the job landed on:

NODE=$(kubectl get pod -n batch-jobs -l app=ml-training -o jsonpath='{.items[0].spec.nodeName}') && echo "Protected node: $NODE"

Verify the node exists and is not being drained (it will persist as long as the annotated pod runs):

kubectl get node $NODE

Clean up

kubectl delete -f batch-training-job.yaml