Benchmark foundation models for AWS Chips¶

You can use FMBench for benchmarking foundation model on AWS Chips: Trainium 1, Inferentia 2. This can be done on Amazon SageMaker, Amazon EKS or on Amazon EC2. FMs need to be first compiled for Neuron before they can be deployed on AWS Chips, this is made easier by SageMaker JumpStart which provides most of the FMs as a JumpStart Model that can be deployed on SageMaker directly, you can also compile models for Neuron yourself or do this through FMBench itself. All of these options are described below.

Benchmarking for AWS Chips on SageMaker¶

Several FMs are available through SageMaker JumpStart already compiled for Neuron and ready to deploy. See this link for more details.
You can compile the model outside of FMBench using instructions available here and on the Neuron documentation, deploy on SageMaker and use FMBench in the bring your own endpoint mode, see this config file for an example.
You can have FMBench compile and deploy the model on SageMaker for you. See this Llama3-8b config file for example or this Llama3.1-70b. Search this website for "inf2" or "trn1" to find other examples. In this case FMBench will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt file, the file simply contains the token without any formatting), compile it for neuron, upload the compiled model to S3 (you specify the bucket in the config file) and then deploy the model to a SageMaker endpoint.

Benchmarking for AWS Chips on EC2¶

You may want to benchmark models hosted directly on EC2. In this case both FMBench and the model are running on the same EC2 instance. FMBench will deploy the model for you on the EC2 instance. See this Llama3.1-70b file for example or this Llama3-8b file. In this case FMBench will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt file, the file simply contains the token without any formatting), pull the inference container from the ECR repo and then run the container with the downloaded model, a local endpoint is provided that is then used by FMBench to run inference.