Invoke Model API Example

Overview

This notebook demonstrates how to get started with Amazon Bedrock. We will show you how to query different models from the Bedrock API call and how prompt engineering can help improving the results of your use case.

The code presented here has been adapted from the Amazon Bedrock Workshop Content

Context

Amazon Bedrock simplifies the process of building and scaling generative AI applications by providing access to high-performing foundation models (FMs) from leading AI companies through a single API.

Amazon Bedrock

Amazon Bedrock supports foundation models (FMs) from the following providers. For the updated list of FMs and respective documentation, see Supported foundation models in Amazon Bedrock

To use a foundation model with the Amazon Bedrock API, you'll need its model ID. For a list for model IDs, see Amazon Bedrock model IDs.

Amazon Bedrock - Choice of FMs

Prerequisites

Before you can use Amazon Bedrock, you must carry out the following steps:

  • Sign up for an AWS account (if you don't already have one) and IAM Role with the necessary permissions for Amazon Bedrock, see AWS Account and IAM Role.
  • Request access to the foundation models (FM) that you want to use, see Request access to FMs.

    We have used below Foundation Models in our examples in this Notebook in us-east-1 (N. Virginia) region.

Provider Name Foundation Model Name Model Id
AI21 Labs Jamba-Instruct ai21.jamba-instruct-v1:0
Amazon Nova Lite amazon.nova-lite-v1:0
Anthropic Claude 3.5 Sonnet anthropic.claude-3-5-sonnet-20240620-v1:0
Cohere Command cohere.command-text-v14
Meta Llama 3 8B Instruct meta.llama3-8b-instruct-v1:0
Mistral AI Mixtral 8X7B Instruct mistral.mixtral-8x7b-instruct-v0:1
Stability AI Stable Diffusion XL stability.stable-diffusion-xl-v1

Setup

Info

This notebook should work well with the Data Science 3.0 kernel (Python 3.10 runtime) in SageMaker Studio

Run the cells in this section to install the packages needed by this notebook.

%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

Notebook/Code with comments

Create the boto3 client

Interaction with the Bedrock API is done via the AWS SDK. We will be using AWS SDK for Python: boto3 for this notebook.

You can refer Amazon Bedrock API references for each SDK.

Use different clients

In case of boto3, Control pane APIs such as ListFoundationModels, are supported by Amazon Bedrock client and data plane APIs such as InvokeModel and InvokeModelWithResponseStream are supported by Amazon Bedrock Runtime client.

The get_bedrock_client() method accepts runtime (default=True) parameter to return either bedrock or bedrock-runtime client.

Use the default credential chain

If you are running this notebook from Amazon Sagemaker Studio and your Sagemaker Studio execution role has permissions to access Bedrock, then you can just run the cells below as-is. This is also the case if you are running these notebooks from a computer whose default AWS credentials have access to Bedrock.

Use a different AWS Region

If you're running this notebook from your own computer or a SageMaker notebook in a different AWS Region from where Bedrock is set up, you can un-comment the os.environ['AWS_DEFAULT_REGION'] line below and specify the region to use.

Use a specific profile

In case you're running this notebook from your own computer where you have setup the AWS CLI with multiple profiles, and the profile which has access to Bedrock is not the default one, you can un-comment the os.environ['AWS_PROFILE'] line below and specify the profile to use.

Use a different role

In case you or your company has setup a specific, separate IAM Role to access Bedrock, you can specify it by un-commenting the os.environ['BEDROCK_ASSUME_ROLE'] line below. Ensure that your current user or role have permissions to assume such role.

"""Helper utilities for working with Amazon Bedrock from Python notebooks"""
# Python Built-Ins:
import os
from typing import Optional
import sys
import json

# External Dependencies:
import boto3
from botocore.config import Config
import botocore

def get_bedrock_client(
    assumed_role: Optional[str] = None,
    region: Optional[str] = None,
    runtime: Optional[bool] = True,
):
    """Create a boto3 client for Amazon Bedrock, with optional configuration overrides

    Parameters
    ----------
    assumed_role :
        Optional ARN of an AWS IAM role to assume for calling the Bedrock service. If not
        specified, the current active credentials will be used.
    region :
        Optional name of the AWS Region in which the service should be called (e.g. "us-east-2").
        If not specified, AWS_REGION or AWS_DEFAULT_REGION environment variable will be used.
    runtime :
        Optional choice of getting different client to perform operations with the Amazon Bedrock service.
    """
    if region is None:
        target_region = os.environ.get("AWS_REGION", os.environ.get("AWS_DEFAULT_REGION"))
    else:
        target_region = region

    print(f"Create new client\n  Using region: {target_region}")
    session_kwargs = {"region_name": target_region}
    client_kwargs = {**session_kwargs}

    profile_name = os.environ.get("AWS_PROFILE")
    if profile_name:
        print(f"  Using profile: {profile_name}")
        session_kwargs["profile_name"] = profile_name

    retry_config = Config(
        region_name=target_region,
        retries={
            "max_attempts": 10,
            "mode": "standard",
        },
    )
    session = boto3.Session(**session_kwargs)

    if assumed_role:
        print(f"  Using role: {assumed_role}", end='')
        sts = session.client("sts")
        response = sts.assume_role(
            RoleArn=str(assumed_role),
            RoleSessionName="langchain-llm-1"
        )
        print(" ... successful!")
        client_kwargs["aws_access_key_id"] = response["Credentials"]["AccessKeyId"]
        client_kwargs["aws_secret_access_key"] = response["Credentials"]["SecretAccessKey"]
        client_kwargs["aws_session_token"] = response["Credentials"]["SessionToken"]

    if runtime:
        service_name='bedrock-runtime'
    else:
        service_name='bedrock'

    bedrock_client = session.client(
        service_name=service_name,
        config=retry_config,
        **client_kwargs
    )

    print("boto3 Bedrock client successfully created!")
    print(bedrock_client._endpoint)
    return bedrock_client
module_path = ".."
sys.path.append(os.path.abspath(module_path))


# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."


boto3_bedrock = get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None),
    runtime=False
)

Validate the connection

We can check the client works by trying out the list_foundation_models() method, which will tell us all the models available for us to use

boto3_bedrock.list_foundation_models()

`InvokeModel` body and output

The invoke_model() method of the Amazon Bedrock runtime client (InvokeModel API) will be the primary method we use for most of our Text Generation and Processing tasks - whichever model we're using.

Although the method is shared, the format of input and output varies depending on the foundation model used, see Inference parameters for foundation models

Common inference parameter definitions

Randomness and Diversity

Foundation models support the following parameters to control randomness and diversity in the response.

Temperature – Large language models use probability to construct the words in a sequence. For any given next word, there is a probability distribution of options for the next word in the sequence. When you set the temperature closer to zero, the model tends to select the higher-probability words. When you set the temperature further away from zero, the model may select a lower-probability word.

In technical terms, the temperature modulates the probability density function for the next tokens, implementing the temperature sampling technique. This parameter can deepen or flatten the density function curve. A lower value results in a steeper curve with more deterministic responses, and a higher value results in a flatter curve with more random responses.

Top K – Temperature defines the probability distribution of potential words, and Top K defines the cut off where the model no longer selects the words. For example, if K=50, the model selects from 50 of the most probable words that could be next in a given sequence. This reduces the probability that an unusual word gets selected next in a sequence. In technical terms, Top K is the number of the highest-probability vocabulary tokens to keep for Top- K-filtering - This limits the distribution of probable tokens, so the model chooses one of the highest- probability tokens.

Top P – Top P defines a cut off based on the sum of probabilities of the potential choices. If you set Top P below 1.0, the model considers the most probable options and ignores less probable ones. Top P is similar to Top K, but instead of capping the number of choices, it caps choices based on the sum of their probabilities. For the example prompt "I hear the hoof beats of ," you may want the model to provide "horses," "zebras" or "unicorns" as the next word. If you set the temperature to its maximum, without capping Top K or Top P, you increase the probability of getting unusual results such as "unicorns." If you set the temperature to 0, you increase the probability of "horses." If you set a high temperature and set Top K or Top P to the maximum, you increase the probability of "horses" or "zebras," and decrease the probability of "unicorns."

Length

The following parameters control the length of the generated response.

Response length – Configures the minimum and maximum number of tokens to use in the generated response.

Length penalty – Length penalty optimizes the model to be more concise in its output by penalizing longer responses. Length penalty differs from response length as the response length is a hard cut off for the minimum or maximum response length.

In technical terms, the length penalty penalizes the model exponentially for lengthy responses. 0.0 means no penalty. Set a value less than 0.0 for the model to generate longer sequences, or set a value greater than 0.0 for the model to produce shorter sequences.

Repetitions

The following parameters help control repetition in the generated response.

Repetition penalty (presence penalty) – Prevents repetitions of the same words (tokens) in responses. 1.0 means no penalty. Greater than 1.0 decreases repetition.

Try out the models

With some theory out of the way, let's see the models in action! Run the cells below to see basic, synchronous example invocations for each model:

bedrock_runtime = get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

def invoke_model(body, model_id, accept, content_type):
    """
    Invokes Amazon bedrock model to run an inference
    using the input provided in the request body.

    Args:
        body (dict): The invokation body to send to bedrock
        model_id (str): the model to query
        accept (str): input accept type
        content_type (str): content type
    Returns:
        Inference response from the model.
    """

    try:
        response = bedrock_runtime.invoke_model(
            body=json.dumps(body), 
            modelId=model_id, 
            accept=accept, 
            contentType=content_type
        )

        return response

    except Exception as e:
        print(f"Couldn't invoke {model_id}")
        raise e

AI21 Labs Jamba

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """Command: Write me a blog about making strong business decisions as a leader.

Blog:
"""

body={"messages": [
            {
                "role": "user", 
                "content": prompt_data 
            }
         ],
        "max_tokens": 200,
    }

modelId = "ai21.jamba-instruct-v1:0"  # change this to use a different version from the model provider
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body.get("choices")[0].get("message").get("content"))

Amazon Nova

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """Command: Write me a blog about making strong business decisions as a leader.

Blog:
"""

# Define one or more messages using the "user" and "assistant" roles.
message_list = [{"role": "user", "content": [{"text": prompt_data}]}]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 250, "top_p": 0.9, "top_k": 20, "temperature": 0.7}

body = {
    "schemaVersion": "messages-v1",
    "messages": message_list,
    "inferenceConfig": inf_params,
}

modelId = "amazon.nova-lite-v1:0"
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body.get("output").get("message").get("content")[0].get("text"))

Anthropic Claude

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """Human: Write me a blog about making strong business decisions as a leader.

Assistant:
"""

messages = [{"role": "user", "content": prompt_data}]

body={
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 250,
        "messages": messages
    }

modelId = "anthropic.claude-3-5-sonnet-20240620-v1:0"  # change this to use a different version from the model provider
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body.get("content")[0].get("text"))

Cohere

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """Command: Write me a blog about making strong business decisions as a leader.

Blog:
"""

body = {
    "prompt": prompt_data,
    "max_tokens": 200,
}

modelId = "cohere.command-text-v14" 
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)

response_body = json.loads(response.get('body').read())

print(response_body.get("generations")[0].get("text"))

Meta Llama

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Write me a blog about making strong business decisions as a leader. [/INST]"""

body = {
    "prompt": prompt_data,
    "temperature": 0.5,
    "top_p": 0.9,
    "max_gen_len": 512,
}

modelId = "meta.llama3-8b-instruct-v1:0"
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body["generation"])

Mistral Models

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Write me a blog about making strong business decisions as a leader. [/INST]"""

body = {
    "prompt": prompt_data,
    'max_tokens': 500,
    'top_p': 0.9,
    'temperature': 0.2
}

modelId = 'mistral.mixtral-8x7b-instruct-v0:1'
accept = 'application/json'
contentType = 'application/json'

response = invoke_model(body, modelId, accept, contentType)

response_body = json.loads(response.get("body").read())
print(response_body.get('outputs')[0].get('text'))

Stability Stable Diffusion XL

# If you'd like to try your own prompt, edit this parameter!
prompt_data = "a landscape with trees"

body = {
    "text_prompts": [{"text": prompt_data}],
    "cfg_scale": 10,
    "seed": 20,
    "steps": 50
}
modelId = "stability.stable-diffusion-xl-v1"
accept = "application/json"
contentType = "application/json"


response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body["result"])
print(f'{response_body.get("artifacts")[0].get("base64")[0:80]}...')

Note

The output is a base64 encoded string of the image data. You can use any image processing library (such as Pillow) to decode the image as in the example below:

import base64
import io
from PIL import Image

base_64_img_str = response_body.get("artifacts")[0].get("base64")
image = Image.open(io.BytesIO(base64.decodebytes(bytes(base_64_img_str, "utf-8"))))
image

Generate streaming output

For large language models, it can take noticeable time to generate long output sequences. Rather than waiting for the entire response to be available, latency-sensitive applications may like to stream the response to users.

Run the code below to see how you can achieve this with Bedrock's invoke_model_with_response_stream() method - returning the response body in separate chunks.

from IPython.display import clear_output, display, display_markdown, Markdown
from datetime import datetime

# If you'd like to try your own prompt, edit this parameter!
prompt_data = """Command: Write me a blog about making strong business decisions as a leader.

Blog:
"""

# Define one or more messages using the "user" and "assistant" roles.
message_list = [{"role": "user", "content": [{"text": prompt_data}]}]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 250, "top_p": 0.9, "top_k": 20, "temperature": 0.7}

body = json.dumps({
    "schemaVersion": "messages-v1",
    "messages": message_list,
    "inferenceConfig": inf_params,
})

modelId = "amazon.nova-lite-v1:0"  # (Change this, and the request body, to try different models)
accept = "application/json"
contentType = "application/json"

start_time = datetime.now()

response = bedrock_runtime.invoke_model_with_response_stream(
    body=body, modelId=modelId, accept=accept, contentType=contentType
)
chunk_count = 0
time_to_first_token = None

# Process the response stream
stream = response.get("body")
if stream:
    for event in stream:
        chunk = event.get("chunk")
        if chunk:
            # Print the response chunk
            chunk_json = json.loads(chunk.get("bytes").decode())
            # Pretty print JSON
            # print(json.dumps(chunk_json, indent=2, ensure_ascii=False))
            content_block_delta = chunk_json.get("contentBlockDelta")
            if content_block_delta:
                if time_to_first_token is None:
                    time_to_first_token = datetime.now() - start_time
                    print(f"Time to first token: {time_to_first_token}")

                chunk_count += 1
                current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S:%f")
                # print(f"{current_time} - ", end="")
                print(content_block_delta.get("delta").get("text"), end="")
    print(f"Total chunks: {chunk_count}")
else:
    print("No response stream received.")

Prompt Engineering

Prompt engineering is the practice of optimizing the quality and performance of your foundation model's response to your request. Prompt engineering may involve:

Word choice
Phrasing
Providing examples (few-shot learning)
Use of line breaks and content separators
Following established formats that align with how the model was trained
Use of stop sequences to help the model know when it should stop generating text

Communicating clearly

The art of prompt engineering is the art of communication. Large language models have been trained on a massive amount of written and transcribed human content. So just like when communicating with people, it's critical to communicate clearly with the models. Throughout these labs, you will see examples of varying levels of detail and clarity.

prompt_data = """Human: Write an email from Bob, Customer Service Manager, 
to the customer "John Doe" that provided negative feedback on the service 
provided by our customer support engineer.

Assistant:
"""
messages = [{"role": "user", "content": prompt_data}]

body={
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 500,
        "messages": messages
    }

modelId = "anthropic.claude-3-5-sonnet-20240620-v1:0"  # change this to use a different version from the model provider
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body.get("content")[0].get("text"))
prompt_data = """Human: Write an email from Bob, Customer Service Manager, 
to the customer "John Doe" that provided negative feedback on the service 
provided by our customer support engineer. Here is the feedback provided.
<customer_feedback>
Hello Bob,
     I am very disappointed with the recent experience I had when I called your customer support and spoke with Anna Bhasin.
     I was expecting an immediate call back but it took three days for us to get a call back.
     The first suggestion to fix the problem was incorrect. Ultimately the problem was fixed after three days.
     We are very unhappy with the response provided and may consider taking our business elsewhere.
</customer_feedback>

Assistant:
"""
messages = [{"role": "user", "content": prompt_data}]

body={
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 500,
        "messages": messages
    }

modelId = "anthropic.claude-3-5-sonnet-20240620-v1:0"  # change this to use a different version from the model provider
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body.get("content")[0].get("text"))

Next steps

Now that we have seen how to use Amazon Bedrock APIs, you can learn

Clean up

This notebook does not require any cleanup or additional deletion of resources.