Skip to content

EKS cluster creation steps

The steps below create an EKS cluster called trainium-inferentia.

  1. Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your machine: aws-cli, kubectl and terraform. We use the DoEKS repository as a guide to deploy the cluster infrastructure in an AWS account.

  2. Ensue that your account has enough Inf2 on-demand VCPUs as most of the DoEKS blueprints utilize this specific instance. To increase service quota navigate to the service quota page for the region you are in service quota. Then select services under the left side menu and search for Amazon Elastic Compute Cloud (Amazon EC2). This will bring up the service quota page, here search for inf and there should be an option for Running On-Demand Inf instances. Increase this quota to 300.

  3. Clone the DoEKS repository

    git clone https://github.com/awslabs/data-on-eks.git
    
  4. Ensure that the region names are correct in variables.tf file before running the cluster creation script.

  5. Ensure that the ELB to be created would be external facing. Change the helm value from internal to internet-facing here.

  6. Ensure that the IAM role you are using has the permissions needed to create the cluster. While we expect the following set of permissions to work but the current recommendation is to also add the AdminstratorAccess permission to the IAM role. At a later date you could remove the AdminstratorAccess and experiment with cluster creation without it.

    1. Attach the following managed policies: AmazonEKSClusterPolicy, AmazonEKS_CNI_Policy, and AmazonEKSWorkerNodePolicy.
    2. In addition to the managed policies add the following as inline policy. Replace your-account-id with the actual value of the AWS account id you are using.

      {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Sid": "VisualEditor0",
              "Effect": "Allow",
              "Action": [
                  "ec2:CreateVpc",
                  "ec2:DeleteVpc"
              ],
              "Resource": [
                  "arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*",
                  "arn:aws:ec2::your-account-id:ipam-pool/*",
                  "arn:aws:ec2:*:your-account-id:vpc/*"
              ]
          },
          {
              "Sid": "VisualEditor1",
              "Effect": "Allow",
              "Action": [
                  "ec2:ModifyVpcAttribute",
                  "ec2:DescribeVpcAttribute"
              ],
              "Resource": "arn:aws:ec2:*:<your-account-id>:vpc/*"
          },
          {
              "Sid": "VisualEditor2",
              "Effect": "Allow",
              "Action": "ec2:AssociateVpcCidrBlock",
              "Resource": [
                  "arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*",
                  "arn:aws:ec2::your-account-id:ipam-pool/*",
                  "arn:aws:ec2:*:your-account-id:vpc/*"
              ]
          },
          {
              "Sid": "VisualEditor3",
              "Effect": "Allow",
              "Action": [
                  "ec2:DescribeSecurityGroupRules",
                  "ec2:DescribeNatGateways",
                  "ec2:DescribeAddressesAttribute"
              ],
              "Resource": "*"
          },
          {
              "Sid": "VisualEditor4",
              "Effect": "Allow",
              "Action": [
                  "ec2:CreateInternetGateway",
                  "ec2:RevokeSecurityGroupEgress",
                  "ec2:CreateRouteTable",
                  "ec2:CreateSubnet"
              ],
              "Resource": [
                  "arn:aws:ec2:*:your-account-id:security-group/*",
                  "arn:aws:ec2:*:your-account-id:internet-gateway/*",
                  "arn:aws:ec2:*:your-account-id:subnet/*",
                  "arn:aws:ec2:*:your-account-id:route-table/*",
                  "arn:aws:ec2::your-account-id:ipam-pool/*",
                  "arn:aws:ec2:*:your-account-id:vpc/*"
              ]
          },
          {
              "Sid": "VisualEditor5",
              "Effect": "Allow",
              "Action": [
                  "ec2:AttachInternetGateway",
                  "ec2:AssociateRouteTable"
              ],
              "Resource": [
                  "arn:aws:ec2:*:your-account-id:vpn-gateway/*",
                  "arn:aws:ec2:*:your-account-id:internet-gateway/*",
                  "arn:aws:ec2:*:your-account-id:subnet/*",
                  "arn:aws:ec2:*:your-account-id:route-table/*",
                  "arn:aws:ec2:*:your-account-id:vpc/*"
              ]
          },
          {
              "Sid": "VisualEditor6",
              "Effect": "Allow",
              "Action": "ec2:AllocateAddress",
              "Resource": [
                  "arn:aws:ec2:*:your-account-id:ipv4pool-ec2/*",
                  "arn:aws:ec2:*:your-account-id:elastic-ip/*"
              ]
          },
          {
              "Sid": "VisualEditor7",
              "Effect": "Allow",
              "Action": "ec2:ReleaseAddress",
              "Resource": "arn:aws:ec2:*:your-account-id:elastic-ip/*"
          },
          {
              "Sid": "VisualEditor8",
              "Effect": "Allow",
              "Action": "ec2:CreateNatGateway",
              "Resource": [
                  "arn:aws:ec2:*:your-account-id:subnet/*",
                  "arn:aws:ec2:*:your-account-id:natgateway/*",
                  "arn:aws:ec2:*:your-account-id:elastic-ip/*"
              ]
          }
      ]
      }
      
      1. Add the Role ARN and name here in the variables.tf file by updating these lines. Move the structure inside the defaut list and replace the role ARN and name values with the values for the role you are using.

  7. Navigate into the ai-ml/trainium-inferentia/ directory and run install.sh script.

    cd data-on-eks/ai-ml/trainium-inferentia/
    ./install.sh
    

    Note: This step takes about 12-15 minutes to deploy the EKS infrastructure and cluster in the AWS account. To view more details on cluster creation, view an example here: Deploy Llama3 on EKS in the prerequisites section.

  8. After the cluster is created, navigate to the Karpenter EC2 node IAM role called karpenter-trainium-inferentia-XXXXXXXXXXXXXXXXXXXXXXXXX. Attach the following inline policy to the role:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Statement1",
                "Effect": "Allow",
                "Action": [
                    "iam:CreateServiceLinkedRole"
                ],
                "Resource": "*"
            }
        ]
    }