Batch Jobs: Protecting Long-Running Workloads from Disruption

The problem

Karpenter (and EKS Auto Mode) continuously consolidates underutilized nodes. This is great for cost optimization, but catastrophic for long-running batch jobs. A 6-hour ML training run evicted at hour 5 wastes 5 hours of GPU compute. An ETL pipeline disrupted mid-write can leave data in an inconsistent state.

Without protection, consolidation treats your 8-hour training job the same as a stateless web server -- just another pod to reschedule.

Prerequisites

Cluster deployed and kubectl configured per Quick Start.

How `karpenter.sh/do-not-disrupt` works

Adding this annotation to a pod's metadata tells Auto Mode: "do not voluntarily evict this pod for consolidation or drift remediation."

metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"

When this annotation is present on any pod running on a node, that entire node becomes protected from voluntary disruption. The node will not be consolidated, drifted, or removed for emptiness as long as the annotated pod is running.

Scope of protection

Disruption Type	Protected?	Example
Consolidation (underutilized)	Yes	Karpenter wants to bin-pack pods onto fewer nodes
Drift remediation	Yes	AMI updated, Karpenter wants to roll nodes
Empty node removal	Yes	All other pods drained, but annotated pod remains
Spot interruption	No	AWS reclaims the instance with 2-min warning
Node health failure	No	EC2 status check fails
Manual `kubectl drain`	No	Human or automation explicitly drains

Key insight: this protects against the scheduler's optimization decisions, not against infrastructure failures. For Spot protection, use on-demand instances. For health failures, implement checkpointing.

Why annotation vs taint

These solve different problems:

Taints control which pods CAN schedule onto a node (admission control)
do-not-disrupt controls whether a node with this pod CAN be consolidated (eviction control)

A GPU taint prevents CPU pods from landing on GPU nodes. do-not-disrupt prevents Karpenter from evicting your training job to consolidate that GPU node.

When to use

ML training jobs (hours to days)
ETL pipelines with expensive restart costs
Video transcoding (long-running, stateful progress)
Database migrations or backfills
Any batch workload where: restart cost > idle node cost

Deploy

Deploy the batch job:

kubectl apply -f batch-training-job.yaml

Verify the job is running:

kubectl get jobs -n batch-jobs
kubectl get pods -n batch-jobs -o wide

What to observe

Confirm the annotation is on the running pod:

kubectl get pod -n batch-jobs -l app=ml-training -o jsonpath='{.items[0].metadata.annotations}'

Identify which node the job landed on:

NODE=$(kubectl get pod -n batch-jobs -l app=ml-training -o jsonpath='{.items[0].spec.nodeName}') && echo "Protected node: $NODE"

Verify the node exists and is not being drained (it will persist as long as the annotated pod runs):

kubectl get node $NODE

Clean up

kubectl delete -f batch-training-job.yaml

The problem​

Prerequisites​

How karpenter.sh/do-not-disrupt works​

Scope of protection​

Why annotation vs taint​

When to use​

Deploy​

What to observe​

Clean up​