Model evaluations using panel of LLM evaluators¶

FMBench release 2.0.0 adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators (PoLL). It gathers quantitative metrics such as Cosine Similarity and overall majority voting accuracy metrics to measure the similarity and accuracy of model responses compared to the ground truth.

Accuracy is defined as percentage of responses generated by the LLM that match the ground truth included in the dataset (as a separate column). In order to determine if an LLM generated response matches the ground truth we ask other LLMs called the evaluator LLMs to compare the LLM output and the ground truth and provide a verdict if the LLM generated ground truth is correct or not given the ground truth. Here is the link to the Anthropic Claude 3 Sonnet model prompt being used as an evaluator (or a judge model). A combination of the cosine similarity and the LLM evaluator verdict decides if the LLM generated response is correct or incorrect. Finally, one LLM evaluator could be biased, could have inaccuracies so instead of relying on the judgement of a single evaluator, we rely on the majority vote of 3 different LLM evaluators. By default we use the Anthropic Claude 3 Sonnet, Meta Llama3-70b and the Cohere Command R plus model as LLM evaluators. See Pat Verga et al., "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models", arXiv:2404.18796, 2024. for more details on using a Panel of LLM Evaluators (PoLL).

Evaluation Flow¶

Provide a dataset that includes ground truth responses for each sample. FMBench uses the LongBench dataset by default.
Configure the candidate models to be evaluated in the FMBench config file. See this config file for an example that runs evaluations for multiple models available via Amazon Bedrock. Running evaluations only requires the following two changes to the config file:
- Set the 4_get_evaluations.ipynb: yes, see this line.
- Set the ground_truth_col_key: answers and question_col_key: input parameters, see this line. The value of ground_truth_col_key and the question_col_key is set to the name of the column in the dataset that contains the ground truth and question respectively.
Run FMBench, which will:
Fetch the inference results containing the model responses
Calculate quantitative metrics (Cosine Similarity)
Use a Panel of LLM Evaluators to compare each model response to the ground truth
Each LLM evaluator will provide a binary verdict (correct/incorrect) and an explanation
Validate the LLM evaluations using Cosine Similarity thresholds
Categorize the final evaluation for each response as correctly correct, correctly incorrect, or needs further evaluation
Review the FMBench report to analyze the evaluation results and compare the performance of the candidate models. The report contains tables and charts that provide insights into model accuracy.

By leveraging ground truth data and a Panel of LLM Evaluators, FMBench provides a comprehensive and efficient way to assess the quality of generative AI models. The majority voting approach, combined with quantitative metrics, enables a robust evaluation that reduces bias and latency while maintaining consistency across responses.