Chat with Your Document

Chat with your document using Knowledge Bases for Amazon Bedrock - RetrieveAndGenerate API

With chat with your document capability, you can securely ask questions on single documents, without the overhead of setting up a vector database or ingesting data, making it effortless for businesses to use their enterprise data. You only need to provide a relevant data file as input and choose your FM to get started.

For details around use cases and benefits, please refer to this blogpost.

Pre-requisites

Python 3.10

⚠ For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠

Setup

Install following packages.

%pip install --upgrade pip
%pip install --upgrade boto3
%pip install --upgrade botocore
%pip install pypdf

<h2>restart kernel</h2>
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Before we begin, lets check the boto3 version, make sure its equal to or greater than 1.34.94

import boto3
boto3.__version__

Initialize client for Amazon Bedrock for accessing the RetrieveAndGenerate API.

import boto3
import pprint
from botocore.client import Config

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
region = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})

bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              region_name=region,
                              config=bedrock_config,
                                    )
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

For data, you can either upload the document you want to chat with or point to the Amazon Simple Storage Service (Amazon S3) bucket location that contains your file. We provide you with both options in the notebook. However in both cases, the supported file formats are PDF, MD (Markdown), TXT, DOCX, HTML, CSV, XLS, and XLSX. Make that the file size does not exceed 10 MB and contains no more than 20K tokens. A token is considered to be a unit of text, such as a word, sub-word, number, or symbol, that is processed as a single entity. Due to the preset ingestion token limit, it is recommended to use a file under 10MB. However, a text-heavy file, that is much smaller than 10MB, can potentially breach the token limit.

Option 1 - Upload the document

In our example, we will use a pdf file.

<h2>load pdf</h2>
from pypdf import PdfReader
<h2>creating a pdf reader object</h2>
file_name = "<path of your file such as file.pdf>" #path of the file on your local machine.
reader = PdfReader(file_name)
<h2>printing number of pages in pdf file</h2>
print(len(reader.pages))
text = ""
page_count = 1
for page in reader.pages:
    text+= f"\npage_{str(page_count)}\n {page.extract_text()}"
print(text)

Option 2 - Point to S3 location of your file

Make sure to replace the bucket_name and prefix_file_name to the location of your file.

bucket_name = "<replace with your bucket name>"
prefix_file_name = "<replace with the file name in your bucket>" #include prefixes if any alongwith the file name.
document_s3_uri = f's3://{bucket_name}/{prefix_file_name}'

RetreiveAndGenerate API for chatting with your document

The code in the below cell, defines a Python function called retrieveAndGenerate that takes two optional arguments: input (the input text) and sourceType (the type of source to use, defaulting to "S3"). It also sets a default value for the model_id parameter.

The function constructs an Amazon Resource Name (ARN) for the specified model using the model_id and the REGION variable.

If the sourceType is "S3", the function calls the retrieve_and_generate method of the bedrock_agent_client object, passing in the input text and a configuration for retrieving and generating from external sources. The configuration specifies that the source is an S3 location, and it provides the S3 URI of the document.

If the sourceType is not "S3", the function calls the same retrieve_and_generate method, but with a different configuration. In this case, the source is specified as byte content, which includes a file name, content type (application/pdf), and the actual text data.

def retrieveAndGenerate(input, sourceType="S3", model_id = "anthropic.claude-3-sonnet-20240229-v1:0"):
    model_arn = f'arn:aws:bedrock:{region}::foundation-model/{model_id}'
    if sourceType=="S3":
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'EXTERNAL_SOURCES',
                'externalSourcesConfiguration': {
                    'modelArn': model_arn,
                    "sources": [
                        {
                            "sourceType": sourceType,
                            "s3Location": {
                                "uri": document_s3_uri
                            }
                        }
                    ]
                }
            }
        )
    else:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'EXTERNAL_SOURCES',
                'externalSourcesConfiguration': {
                    'modelArn': model_arn,
                    "sources": [
                        {
                            "sourceType": sourceType,
                            "byteContent": {
                                "identifier": file_name,
                                "contentType": "application/pdf",
                                "data": text,
                                }
                        }
                    ]
                }
            }
        )

If you want to chat with the document by uploading the file use sourceType as BYTE_CONTENT for pointing it to s3 bucket, use sourceType as S3.

query = "Summarize the document"
response = retrieveAndGenerate(input=query, sourceType="BYTE_CONTENT")
generated_text = response['output']['text']
pp.pprint(generated_text)

Citations or source attributions

Lets retrieve the source attribution or citations for the above response.

citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
         contexts.append(reference["content"]["text"])

pp.pprint(contexts)

Next Steps

In this notebook, we covered how Knowledge Bases for Amazon Bedrock now simplifies asking questions on a single document. We also demonstrated how to configure and use this capability through the Amazon Bedrock - AWS SDK, showcasing the simplicity and flexibility of this feature, which provides a zero-setup solution to gather information from a single document, without setting up a vector database.

To further explore the capabilities of Knowledge Bases for Amazon Bedrock, refer to the following resources:

Knowledge bases for Amazon Bedrock