Testing & Evaluation Framework

Jan 04, 2025·By Bailey Klinger

Over the past 14 months, we have been intensively developing and testing this tool, including rapidly iterating on changes and observing their results to ensure the tool is safe, fair, transparent, and accountable. In order to formalize parts of this process, we have created a Testing and Evaluation Framework. This is a set of questions that are answered by the current and challenger version of the tool and formally evaluated. The question set was selected to test potential risks to safety and fairness, as well as to test advice quality. Some of these questions have a clear factually correct answer (eg. what are the qualification rules for SUNAT’s RUS tax regime for small businesses), while others do not (eg what are some strategies if my sales are declining). Responses to the latter are evaluated by applying best practices from the entrepreneurial training literature (McKenzie & Woodruff 2003).

This T&E process is run on the following schedule:

1) When there is a change to the foundation model (eg if there were a GPT5 released), the new version will be the challenger to the previous version

2) When there are changes to model instructions or RAG resources (particularly in response to a safety or fairness issue identified by our monitoring processes or user feedback), the updated version will be the challenger to the previous version

3) If 6 months have passed without 1 or 2, the question set will be re-run as challenger against the responses from that same model 6 months ago, to identify any drift in performance.

This is how it works: the question in the question column is run through the champion and challenger versions of the tool, and the responses are then judged based on the corresponding evaluation criteria details. A challenger would be implemented if it scores higher on advice quality if and only if it does not score any worse on bias, hallucinations, and intended use checks. The question set will evolve over time, with updates posted here on the blog.

In this implementation of the T&E framework, the champion is the current version of the tool (MAIA Bodega v4) and the challenger is GPT3.5 turbo with no system instructions, to illustrate the impact of our tool’s current defences for responsible and safe AI, such as using a frontier model (GPT4o), carefully designed model instructions (available on the blog), and detailed RAG resources (also available on the blog).

See the T&E framework here: Google Sheet