EKS cluster creation steps¶
The steps below create an EKS cluster called trainium-inferentia
.
-
Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your machine: aws-cli, kubectl and terraform. We use the
DoEKS
repository as a guide to deploy the cluster infrastructure in an AWS account. -
Ensue that your account has enough
Inf2
on-demand VCPUs as most of the DoEKS blueprints utilize this specific instance. To increase service quota navigate to the service quota page for the region you are in service quota. Then select services under the left side menu and search for Amazon Elastic Compute Cloud (Amazon EC2). This will bring up the service quota page, here search forinf
and there should be an option for Running On-Demand Inf instances. Increase this quota to 300. -
Clone the
DoEKS
repository -
Ensure that the region names are correct in
variables.tf
file before running the cluster creation script. -
Ensure that the ELB to be created would be external facing. Change the helm value from
internal
tointernet-facing
here. -
Ensure that the IAM role you are using has the permissions needed to create the cluster. While we expect the following set of permissions to work but the current recommendation is to also add the
AdminstratorAccess
permission to the IAM role. At a later date you could remove theAdminstratorAccess
and experiment with cluster creation without it.- Attach the following managed policies:
AmazonEKSClusterPolicy
,AmazonEKS_CNI_Policy
, andAmazonEKSWorkerNodePolicy
. -
In addition to the managed policies add the following as inline policy. Replace your-account-id with the actual value of the AWS account id you are using.
1. Add the Role ARN and name here in the{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "ec2:CreateVpc", "ec2:DeleteVpc" ], "Resource": [ "arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*", "arn:aws:ec2::your-account-id:ipam-pool/*", "arn:aws:ec2:*:your-account-id:vpc/*" ] }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "ec2:ModifyVpcAttribute", "ec2:DescribeVpcAttribute" ], "Resource": "arn:aws:ec2:*:<your-account-id>:vpc/*" }, { "Sid": "VisualEditor2", "Effect": "Allow", "Action": "ec2:AssociateVpcCidrBlock", "Resource": [ "arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*", "arn:aws:ec2::your-account-id:ipam-pool/*", "arn:aws:ec2:*:your-account-id:vpc/*" ] }, { "Sid": "VisualEditor3", "Effect": "Allow", "Action": [ "ec2:DescribeSecurityGroupRules", "ec2:DescribeNatGateways", "ec2:DescribeAddressesAttribute" ], "Resource": "*" }, { "Sid": "VisualEditor4", "Effect": "Allow", "Action": [ "ec2:CreateInternetGateway", "ec2:RevokeSecurityGroupEgress", "ec2:CreateRouteTable", "ec2:CreateSubnet" ], "Resource": [ "arn:aws:ec2:*:your-account-id:security-group/*", "arn:aws:ec2:*:your-account-id:internet-gateway/*", "arn:aws:ec2:*:your-account-id:subnet/*", "arn:aws:ec2:*:your-account-id:route-table/*", "arn:aws:ec2::your-account-id:ipam-pool/*", "arn:aws:ec2:*:your-account-id:vpc/*" ] }, { "Sid": "VisualEditor5", "Effect": "Allow", "Action": [ "ec2:AttachInternetGateway", "ec2:AssociateRouteTable" ], "Resource": [ "arn:aws:ec2:*:your-account-id:vpn-gateway/*", "arn:aws:ec2:*:your-account-id:internet-gateway/*", "arn:aws:ec2:*:your-account-id:subnet/*", "arn:aws:ec2:*:your-account-id:route-table/*", "arn:aws:ec2:*:your-account-id:vpc/*" ] }, { "Sid": "VisualEditor6", "Effect": "Allow", "Action": "ec2:AllocateAddress", "Resource": [ "arn:aws:ec2:*:your-account-id:ipv4pool-ec2/*", "arn:aws:ec2:*:your-account-id:elastic-ip/*" ] }, { "Sid": "VisualEditor7", "Effect": "Allow", "Action": "ec2:ReleaseAddress", "Resource": "arn:aws:ec2:*:your-account-id:elastic-ip/*" }, { "Sid": "VisualEditor8", "Effect": "Allow", "Action": "ec2:CreateNatGateway", "Resource": [ "arn:aws:ec2:*:your-account-id:subnet/*", "arn:aws:ec2:*:your-account-id:natgateway/*", "arn:aws:ec2:*:your-account-id:elastic-ip/*" ] } ] }
variables.tf
file by updating these lines. Move the structure inside thedefaut
list and replace the role ARN and name values with the values for the role you are using.
- Attach the following managed policies:
-
Navigate into the
ai-ml/trainium-inferentia/
directory and run install.sh script.Note: This step takes about 12-15 minutes to deploy the EKS infrastructure and cluster in the AWS account. To view more details on cluster creation, view an example here: Deploy Llama3 on EKS in the prerequisites section.
-
After the cluster is created, navigate to the Karpenter EC2 node IAM role called
karpenter-trainium-inferentia-XXXXXXXXXXXXXXXXXXXXXXXXX
. Attach the following inline policy to the role: