Source

This page is generated from skills/eks-operation-review/references/deployment-practices.md. Edit the source, not this page.

Vendored skill

This skill is sourced from eks-operation-review, also maintained by the APEX team.

Deployment Practices

Purpose

Assess deployment strategies, CI/CD integration, and graceful shutdown configuration.

Automation Note

CI/CD pipeline details (approval gates, post-deployment tests) are not fully detectable from cluster state. The skill checks for tool presence and configuration; process maturity items are marked UNKNOWN.

Checks to Execute

8.1 — Deployment Strategy & Rollback

What to check:

Deployment strategies in use (RollingUpdate vs Recreate)
maxUnavailable and maxSurge settings (defaults of 25% are risky for replicas <= 4 — e.g., 25% of 2 = 0, meaning no controlled rollout. Recommend maxUnavailable: 0, maxSurge: 1 for small deployments)
Argo Rollouts resources
Flagger Canary resources
terminationGracePeriodSeconds and preStop hooks

How to check:

List Deployments → inspect spec.strategy.type, rollingUpdate.maxUnavailable, rollingUpdate.maxSurge
Flag deployments with replicas <= 4 and default maxUnavailable: 25%
List Rollouts (Argo Rollouts CRD, if exists)
List Canaries (Flagger CRD, if exists)
Inspect Deployments for terminationGracePeriodSeconds and lifecycle.preStop

Rating:

🟢 GREEN: Zero-downtime strategy (maxUnavailable: 0), graceful shutdown configured
🟡 AMBER: Rolling update but default settings, or no progressive delivery
🔴 RED: Deployments cause downtime, no graceful shutdown
⬜ UNKNOWN: Cannot determine rollback speed or process — suggest user investigate

8.2 — CI/CD Pipeline Integration

What to check:

ECR repositories: scan-on-push, tag immutability
Admission webhooks enforcing image policies
Image registries in use (ECR vs public)

How to check:

Describe ECR repositories → scanOnPush, imageTagMutability
List ValidatingWebhookConfigurations → filter for image/policy-related names
List running pods → aggregate image registries

Rating:

🟢 GREEN: Images scanned in CI, private registry, admission enforcement
🟡 AMBER: Pipeline exists but no scanning, or no admission enforcement
🔴 RED: No CI/CD evidence, images from untrusted public registries
⬜ UNKNOWN: Cannot determine full pipeline from cluster state — suggest user investigate

8.3 — Graceful Shutdown & Connection Draining

What to check:

Deployments with preStop hooks vs without
terminationGracePeriodSeconds (default 30s vs customized)
Services with AWS Load Balancer annotations (deregistration delay matters)
Ingress resources with target group attributes

How to check:

List Deployments → count those with lifecycle.preStop vs without
List Deployments → check terminationGracePeriodSeconds (null = default 30s)
List Services → check for service.beta.kubernetes.io/aws-load-balancer-type annotation
List Ingresses → check for alb.ingress.kubernetes.io/target-group-attributes annotation (deregistration delay should match or exceed terminationGracePeriodSeconds)

Rating:

🟢 GREEN: preStop hooks on all externally-facing deployments, grace period tuned, LB drain aligned
🟡 AMBER: Some deployments have preStop but not all
🔴 RED: No preStop hooks and experiencing 502s, or grace period too short
⬜ UNKNOWN: Cannot determine if 502s occur during deployments — suggest user investigate

Key talking point: There's a race condition during pod termination. The LB still sends traffic for a few seconds after SIGTERM. A preStop sleep of 5-10s fixes it.

Purpose​

Automation Note​

Checks to Execute​

8.1 — Deployment Strategy & Rollback​

8.2 — CI/CD Pipeline Integration​

8.3 — Graceful Shutdown & Connection Draining​

Purpose

Automation Note

Checks to Execute

8.1 — Deployment Strategy & Rollback

8.2 — CI/CD Pipeline Integration

8.3 — Graceful Shutdown & Connection Draining