Evaluating Maia

May 12, 2025By Bailey Klinger
Bailey Klinger

Why evaluation matters for MAIA

Every week, hundreds of small business owners use Maia to improve their business. And Maia is not a finished tool, we are constantly tinkering and experimenting to make it better. To keep improving—and to reassure partners and users that the tool is both safe and effective—we need a systematic way to test whether MAIA is doing what we think it’s doing. Over the past month have posted information about our testing & evaluation framework, take-up patterns, and experiment results.

A timely reference point to tie all this together is the “AI Evaluation Framework for the Development Sector” published by the Center for Global Development (CGD) and written by leaders from the Agency Fund. The authors outline four incremental levels of evaluation—from technical correctness all the way to real-world impact. Importantly, this framwork adds sorely needed nuance to the previous model in development of simply 'intervention+RCT', and highlights asking the right questions at the right stage.

Below, we map MAIA’s current (and planned) evaluation activities to each level.

Level 1 - Model Evaluation

Does Maia behave the way we want it to? If the tool can't accurately give users information about their tax deadline or best practices in sales and marketing, there is no need to wait for an RCT. Moreover, we need to make frequent decisions about updating the prompt, or updating to new LLMs as they are released. This is why we use an extensive testing and evaluation framework, which is partially automated but also uses some subjective response rating, to evaluate responses on a regular sechedule as well as with any major update. 

Level 2 - Product Evaluation

Are users engaging with the tool and solving real-world problems? One important feature of our T&E process is manual review. Bailey dedicates up to 2 hours every day manually reviewing chat logs to evaluate all aspects of the product, identify bugs and improvements, etc. In addition, we closely monitor usage patterns (Monthly Active Users and Weekly Active Users), and use these as early indicators for product improvements to iterate fast.

Level 3 - User Evaluation

First, we directly ask users if the tool is helping them and their business, see results here. We are also working in partnership with the ILO to test some cognitive outcomes among small-scale coffee farmers using the tool, specifically surveying their knowlege of specific best practices for production, worker safety, and formalization. Finally, with our new and improved follow-up messages we will be able to measure how often users self-report implementing their action plans from the previous week, and study why is that higher for some users than others.

Level 4 - Independent Impact Evaluation

Our ultimate goal is for Maia to directly and causally improve business outcomes for users, so that is what we want to measure. I have published a quasi-experimental impact evaluation (regression discontinuity) of entrepreneurial training programs and collaborated extensively with organizations like JPAL and IPA on other work. Our goal is to get to as good of a product that we can using Levels 1-3, as well as to the nescessary user volumes, and then collaborate on RCTs of the tool. We seek to follow in the footsteps of Otis et al and go beyond the simple up-front question 'what is the impact of this AI tool' and further unpack the more complicated question of what types of AI assistance are useful (or harmful) for what kinds of entrepreneurs facing what situations.