Benchmark foundation models for AWS Chips¶
You can use FMBench
for benchmarking foundation model on AWS Chips: Trainium 1, Inferentia 2. This can be done on Amazon SageMaker, Amazon EKS or on Amazon EC2. FMs need to be first compiled for Neuron before they can be deployed on AWS Chips, this is made easier by SageMaker JumpStart which provides most of the FMs as a JumpStart Model that can be deployed on SageMaker directly, you can also compile models for Neuron yourself or do this through FMBench
itself. All of these options are described below.
Benchmarking for AWS Chips on SageMaker¶
-
Several FMs are available through SageMaker JumpStart already compiled for Neuron and ready to deploy. See this link for more details.
-
You can compile the model outside of
FMBench
using instructions available here and on the Neuron documentation, deploy on SageMaker and useFMBench
in thebring your own endpoint
mode, see this config file for an example. -
You can have
FMBench
compile and deploy the model on SageMaker for you. See this Llama3-8b config file for example or this Llama3.1-70b. Search this website for "inf2" or "trn1" to find other examples. In this caseFMBench
will download the model from Hugging Face (you need to provide your HuggingFace token in the/tmp/fmbench-read/scripts/hf_token.txt
file, the file simply contains the token without any formatting), compile it for neuron, upload the compiled model to S3 (you specify the bucket in the config file) and then deploy the model to a SageMaker endpoint.
Benchmarking for AWS Chips on EC2¶
You may want to benchmark models hosted directly on EC2. In this case both FMBench
and the model are running on the same EC2 instance. FMBench
will deploy the model for you on the EC2 instance. See this Llama3.1-70b file for example or this Llama3-8b file. In this case FMBench
will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt
file, the file simply contains the token without any formatting), pull the inference container from the ECR repo and then run the container with the downloaded model, a local endpoint is provided that is then used by FMBench
to run inference.