Multi-cluster pattern with observability, cost optimizations and metrics aggregation¤
Objective¤
This pattern was started to solve a problem faced at AWS. We often get third-party software for validation and need a consistent automated approach to run Kubernetes evaluator testing, deployment of containerized products, and validation in Kubernetes environments on a variety of Amazon EKS environments.
In this pattern we:
-
Automate deployment of multiple EKS cluster in a region, with a Continuous Deployment pipeline triggered upon a commit to the GitHub repository that hosts the pipeline configuration.
-
Configure the EKS clusters to deploy with different architectures (x86 or ARM or Bottlerocket) and different Kubernetes versions (3 most recent by default).
-
Automate testing of all the available EKS Anywhere Addons, on each of the clusters, essentially testing their compatibility across all the potential architecture/version available today on AWS.
-
Deploying this pattern 24x7 we observed high costs (300$ a day). By using the AWS Systems Manager Automations and AutoScaling Groups we scale-down to zero during non-business hours resulting in 60% cost savings. We also borrowed optimized OTEL collector configurations from CDK Observability Accelerator to further reduce Prometheus storage costs.
To learn more about our EKS Addon validation checkout our blog
GitOps confguration¤
GitOps is a branch of DevOps that focuses on using Git code repositories to manage infrastructure and application code deployments.
For this pattern there is a git driven deployment using GitHub and Codepipeline which automatically redploys the EKS Clusters when modifications are made to the GitHub repo.
Secondly, for the deployment of workloads on the cluster we leverage FluxCD, this a GitOps approach for the workloads i.e. the third-party-software we want to validate on our hardware.
We require some additional secrets to be created in Secrets Manager for the pattern to function properly
- AWS CodePipeline Bootstrap - The AWS CodePipeline points to the GitHub fork of this repository i.e [cdk-eks-blueprint-patterns] (https://github.com/aws-samples/cdk-eks-blueprints-patterns).
A github-token
secret must be stored as plaintext in AWS Secrets Manager for the CodePipeline to access the webhooks on GitHub. For more information on how/why to set it up, please refer to the docs. The GitHub Personal Access Token should have these scopes:
1. repo - to read your forked cdk-blueprint-patterns repostiory
1. admin:repo_hook - if you plan to use webhooks (enabled by default)
- FluxCD Bootstrap - The FluxCD points to the EKS Anywhere Addons repository. Since this is a public repository you will not need to add a github token to read it.
As part of the FluxCD configuration, it uses Kustomize to apply all the addons that are in the repository along with deploying their functional tests and a custom validator cronJob.
Prerequisites¤
Start by setting the account and region environment variables:
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export AWS_REGION=$(aws configure get region)
```bash
cdk bootstrap
```
- Fork this repository (cdk-eks-blueprints-patterns) to your GitHub organisation/user
- Git clone your forked repository onto your machine
-
Install the AWS CDK Toolkit globally on your machine using
npm install -g aws-cdk@2.133.0
-
Increase AWS service quota for required resources, navigate to Service Quota Tutorial to learn more
We are using seperate VPC as a best practice, but you can use default vpc if you prefer. Also, If you decide to use different regions for each cluster you dont need quota increase, please reach out if you have need for this use case.SERVICE | QUOTA NAME | REQUESTED QUOTA Amazon Virtual Private Cloud (Amazon VPC) | NAT gateways per Availability Zone | 30 Amazon Virtual Private Cloud (Amazon VPC) | VPCs per region | 30 Amazon Elastic Compute Cloud (Amazon EC2) | EC2-VPC Elastic IPs | 30
-
Amazon Managed Grafana Workspace: To visualize metrics collected, you need an Amazon Managed Grafana workspace. If you have an existing workspace, create environment variables
AMG_ENDPOINT_URL
as described below.
Else, to create a new workspace, visit and run our supporting example for Grafana Deployment
export AMG_ENDPOINT_URL=https://g-xxx.grafana-workspace.region.amazonaws.com
export AMG_WORKSPACE_ID=g-xxx
-
Grafana API Key: Amazon Managed Grafana provides a control plane API for generating Grafana API keys or Service Account Tokens. This allows programatic provisioning of Grafana dashboards using the EKS grafana operator.
export AMG_API_KEY=$(aws grafana create-workspace-api-key \ --key-name "grafana-operator-key" \ --key-role "ADMIN" \ --seconds-to-live 432000 \ --workspace-id $AMG_WORKSPACE_ID \ --query key \ --output text)
-
AWS SSM Parameter Store for GRAFANA API KEY: Update the Grafana API key secret in AWS SSM Parameter Store using the above new Grafana API key. This will be referenced by Grafana Operator deployment of our solution to access and provision Grafana dashboards from Amazon EKS monitoring Cluster
aws ssm put-parameter --name "/grafana-api-key" \
--type "SecureString" \
--value $AMG_API_KEY \
--region $AWS_REGION
- Amazon Managed Prometheus Workspace: To store observability metrics from all clusters we will use Amazon Managed Prometheus due to it's ease of setup and easy integration with other AWS services. We recommend setting up a new seperate Prometheus workspace using the CLI commands below. The provisioning of a new AMP workspace can be automated by leveraging the
.resourceProvider
in our CDK blueprints. See Example. We intentionally left this out to allow to connecting with existing AMP deployments, but please reach out to us if you need guidance on automate this provisioning.
aws amp create-workspace --alias conformitron
Copy the workspaceID
from the output and export it as a variable
export AMP_WS_ID=ws-xxxxxxx-xxxx-xxxx-xxxx-xxxxxx
- Modify the code in your forked repo to point to your GitHub username/organisation. Open the pattern file source code and look for the declared const of
gitOwner
. Change it to your GitHub username.
Deploying¤
Clone the repository:
git clone https://github.com/aws-samples/cdk-eks-blueprints-patterns.git
cd cdk-eks-blueprints-patterns
Set the pattern's parameters in the CDK context by overriding the cdk.json file (edit PARENT_DOMAIN_NAME as it fits):
cat << EOF > cdk.json
{
"app": "npx ts-node dist/lib/common/default-main.js",
"context": {
"conformitron.amp.endpoint": "https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WS_ID}/",
"conformitron.amp.arn":"arn:aws:aps:${AWS_REGION}:${ACCOUNT_ID}:workspace/${AMP_WS_ID}",
"conformitron.amg.endpoint": "${AMG_ENDPOINT_URL}",
"conformitron.version": ["1.28","1.29","1.30"],
"fluxRepository": {
"name": "grafana-dashboards",
"namespace": "grafana-operator",
"repository": {
"repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
"name": "grafana-dashboards",
"targetRevision": "main",
"path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
},
"values": {
"GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
"GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
"GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
"GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
"GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
"GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json"
},
"kustomizations": [
{
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
}
]
}
}
}
EOF
You are now ready to deploy the pipeline. Run the following command from the root of this repository to deploy the pipeline stack:
make pattern multi-cluster-conformitron deploy multi-cluster-central-pipeline
Now you can go to AWS CodePipeline console, and see how it was automatically created to deploy multiple Amazon EKS clusters to different environments.
Grafana Dashboards¤
SSM Cost Optimizations for conformitron clusters¤
Running all the clusters by default for 24 hours results in a daily spend of $300+
To minimize these costs we have written a systems manager automation which automatically scales down autoscaling group to 0 desired nodes during off-business hours.
On weekdays 5 PM PST clusters are scaled to 0 -> CRON EXPRESSION: 0 17 ? * MON-FRI *
On weekdays 5 AM PST clusters are scaled to 1 -> CRON EXPRESSION: 0 05 ? * MON-FRI *
On weekends clusters stay scaled to 0.
These optimizations bring down the weekly cost to less than 1000$ essentially for a more than 60% cost savings.
Please find the SSM Automation documents lib/multi-cluster-construct/resources/cost-optimization/scaleDownEksToZero.yml
and lib/multi-cluster-construct/resources/cost-optimization/scaleUpEksToOne.yml
.
Lets take a look at one of the scripts scaleDownEksToZero.yml
schemaVersion: '0.3'
...
...
mainSteps:
...
...
inputs:
Service: eks
Api: UpdateNodegroupConfig <---- Update the managed node group
clusterName: arm-1-26-blueprint <---- Modify according to your naming convention
nodegroupName: eks-blueprints-mng
scalingConfig:
minSize: 0 <---- New Scaling Configuration
maxSize: 1
desiredSize: 0 <---- Scale To zero
To run these scripts first you will have to modify update them with your own account_ID
We will use sed
command to automatically update the files
sed "s/ACCOUNT_ID/$ACCOUNT_ID/g" scaleDownEksToZero.yml > scaleDownEksToZeroNew.yml
sed "s/ACCOUNT_ID/$ACCOUNT_ID/g" scaleUpEksToOne.yml > scaleUpEksToOneNew.yml
- Then navigate to the Systems Manager > Documents and Create a new Automation.
- Click on JSON and copy over the yml content to create a new runbook
- Once saved, navigate to EventBridge > Scheduler > Schedules
- Create a new schedule with the CRON expression specified aboce
- For Target select "StartAutomationExecution" and type in the document name from step 2