Vector Storage

Alternative Reference

If you looking to benchmark multiple LLMs and RAG engines in a simple way, you should checkout aws-samples/aws-genai-llm-chatbot. That project focuses more on experimentation with models and vector stores, while this project focuses more on building an extendable 3-tier application.

Currently, Galileo offers a single implementation for the storage of RAG vector embeddings: Aurora PostgreSQL Serverless with pgvector.

Postgres Table Naming

To support multiple embedding models and vector sizes, the current implementation creates a database table name based on normalized model id and vector size. If you change the embedding model, or vector size, it will create a new database table. Currently there is no support for choosing which database table to use at runtime, you must deploy the updates and re-index the data into the new table. We are working on a more scalable solution for this.

Getting data into the vector store

There is an indexing pipeline included in the Corpus stack which is a AWS Step Function state machine that is capable of processing a large amount of files in parallel (40k+ files). The pipeline supports incremental and bulk updates, and is configured to index all files in the "processed bucket" included in the corpus stack.

The "processed bucket" is the destination for objects that have already been processed, and expected to contain only raw text files with metadata defined on the objects. For data transformation, it is expected to have custom ETL processes for the data which will end up in this bucket.

Example Only

This is very much an example architecture for data ingestion at scale, and is expected to be replaced or modified to support your specific use case. Rather than attempting to force your implementation into the current process, it is recommended to replace this based on your needs. As we learn more about the patterns customers are using, we will work on extending this to support more real world use cases.

Manual Process

Currently the state machine must be manually triggered. It also supports scheduling which is disabled by default, which can be configured in the corpus stack (demo/infra/src/application/corpus/index.ts) properties.

Data import flow

Using the CLI's Document Uploader, the document import flow is as follows:

sequenceDiagram
  actor User
  participant CLI
  participant S3 as S3<br/>bucket
  participant SFN as Indexing<br/>Workflow
  participant VectorStore as Vector<br/>Store
  participant SMJob as Sagemaker<br/>Processing Job
  participant SMEmb as Sagemaker Endpoint<br/>Embedding model

  autonumber

  User ->> CLI: document upload
  CLI ->> CLI: prepare data
  CLI ->> S3: upload documents<br/>(with metadata)
  CLI -->> User: trigger workflow?
  User ->> CLI: yes
  CLI -->> SFN: start execution

  SFN ->> SFN: config workflow
  SFN ->> S3: query changeset
  S3 -->> SFN: objects

  SFN ->> VectorStore: initialize<br/>vector store

  SFN ->> SMJob: start processing job
  activate SMJob

  loop For each document
    SMJob ->> SMEmb: toVector(document)
    SMEmb -->> SMJob: vector
    SMJob ->> VectorStore: upsert(vector)
  end

  SMJob -->> SFN: resume<br/>(processing job finished)
  deactivate SMJob

  alt Indexing on?
    SFN ->> VectorStore: create/update<br/>index
  end

Sample Dataset

The current sample dataset (US Supreme Court Cases), is defined as a stack which uses the CDK S3 Deployment construct to batch deploy data into the "processed bucket" with respective metadata. Additionally the sample data set stack will automatically trigger the state machine for indexing.

Being Deprecated

We are working on completely refactoring the way we handle sample data, and enable easy testing of other local data via the cli. Expect this to change very soon.

FAQs

How to load sample data for a demo?

If you deploy via the cli, you can choose to load the supplied sample dataset. To load your own data, run pnpm galileo-cli document upload --help to use the provided helper.

You can also manually add files to the S3 bucket provided with metadata, and then run the corpus state machine to perform indexing.

Embeddings

This project supports multiple embedding models through the CLI or the demo/infra/config.json file. Each embedding model is uniquely identified by a model reference key (a.b.a modelRefKey), which is a human-readable phrase composed of ASCII characters and digits.

In the following example, two embedding models are configured in the demo/infra/config.json file. One model utilizes the intfloat/multilingual-e5-base model, while the other employs the sentence-transformers/all-mpnet-base-v2 model.

  <trimmed>
  "rag": {
    "managedEmbeddings": {
      "instanceType": "ml.g4dn.xlarge",
      "embeddingsModels": [
        {
          "uuid": "multilingual-e5-base",
          "modelId": "intfloat/multilingual-e5-base",
          "dimensions": 768,
          "modelRefKey": "English",
          "default": true
        },
        {
          "uuid": "all-mpnet-base-v2",
          "modelId": "sentence-transformers/all-mpnet-base-v2",
          "dimensions": 768,
          "modelRefKey": "Vietnamese"
        }
      ],
      "autoscaling": {
        "maxCapacity": 5
      }
    },
  },
  <trimmed>

Embedding is handled by a SageMaker Endpoint that supports multiple models with a custom script. Through configuration, you can deploy multiple models for direct testing, and the application will use the specified model for runtime when the modelRefKey is provided in the embeddings request. If the modelRefKey is not provided, the default (or first) model will be used to serve the embeddings request.

Supports all AutoModels from transformers package.

packages/galileo-cdk/src/ai/llms/models/managed-embeddings
packages/galileo-cdk/src/ai/llms/models/managed-embeddings/custom.asset/code/inference.py
demo/infra/src/application/corpus

How to change embedding model for Chat?

In the Chat settings panel, navigate to the Semantic Search tab. From the Embedding Model dropdown list, select the desired embedding model.

How to change embedding model for indexing pipeline?

When submitting documents to indexing pipeline, you need to provide modelRefKey which is used to indicate which embedding model to use.

How to change document chunking?

Chunk size and overlap are not configurable via cli or config at this time, you will need to edit the source code.

demo/corpus/logic/src/env.ts contains the default env vars
demo/infra/src/application/corpus/index.ts contains env var overrides for the state machine

CHUNK_SIZE: "1000",
CHUNK_OVERLAP: "200",

How to filter data and improve search results?

Currently the backend support filtering, however the UI does not have controls for filtering yet. You will need to customize the UI and chat message api calls to support filtering at this time.

Pipeline StateMachine

See demo/infra/src/application/corpus/pipeline/types.ts for all configuration options.

How to start indexing objects in S3?

Just start an execution of the Pipeline StateMachine from the StepFunction console with the default payload. It will auto-detect new/modified files and will index them, or if everything is new it will bulk index everything. If you change the embedding model and run execution, it will create a new database table and index for the new embeddings.

How do I force BULK re-indexing?

If you want to force re-indexing everything, use the following payload.

{
  "IndexingStrategy": "BULK",
  "SubsequentExecutionDelay": 0,
  "ModifiedSince": 1
}

How to test indexing on a small sample set?

You can limit the number of files that are processed with the following payload.

{
  "MaxInputFilesToProcess": 1000,
  "IndexingStrategy": "MODIFIED",
  "SubsequentExecutionDelay": 0,
  "TargetContainerFilesCount": 500,
  "ModifiedSince": 1
}

Process the first 1000 files, across 2 containers

How do I override environment variables?

You can override environment variables from the payload as well (make sure all keys and values are string):

{
  "IndexingStrategy": "BULK",
  "SubsequentExecutionDelay": 0,
  "ModifiedSince": 1,
  "Environment": { "CHUNK_SIZE": "2000", "CHUNK_OVERLAP": "400" }
}

Bulk reindexing with custom chunksize

How to reset vector store data?

The state machine can be executed with the following payload to delete the database table data (TRUNCATE), reset the index, and force bulk re-indexing.

{
  "VectorStoreManagement": {
    "PurgeData": true,
    "CreateIndexes": false
  }
}