Predicting Whether MAIA Worked
Reading Impact From the Conversation: Predicting Whether MAIA Worked, Without Asking
We face a typical problem with our own internal impact surveys (which are different from external impact evaluations): the people who answer your internal impact survey might be special, and not represent the non-respondents, the silent majority. If their outcomes look systematically different from the respondents', our internal impact metrics would be biased. This bias could be in two directions. It could be that only successful power users bother to respond to the survey, biasing results upwards. Or it could be the case that, because many of our users are given the tool through a government agency like Tu Empresa or AMPYME, which also provide financial support to struggling businesses, users see the survey as an opportunity to get financial assistance and bias the results downards.
For MAIA, this is really important. We coach small business owners over months. Our outcome surveys at 60 and 90 days ask whether their costs went down and their sales went up. The active users who respond are giving us valuable signal, but they're a minority of our users, and only give this information at a few points in time. Are they representative of non-respondents? How does this impact change over time beyond the survey questions?
Over the past few months we built and tested an answer. We built a model that predicts how a user would have answedr the impact surveys, based on their MAIA conversations and resulting profile and diagnostic. At the cohort level, its predictions match actual survey outcomes within a few percentage points. Using a metric from credit scoring, the AUC is 0.83. Importantly, all of the signals were derived blind to the outcome variable. We then examine those same conversation features for the entire user base and compare respondents to non-respondents. The results show they look the same: the respondents are not a biased subset.
What MAIA already knows about a user
Every active MAIA user has a structured profile that the system rebuilds periodically from their conversation history. The profile captures things like the user's business sector and stage, their stated goals, their current operational practices, which of the eight capability pillars (customer and market, offer quality, sales channels, operations and supply, finance and records, costing and pricing, people and productivity, planning and risk) MAIA assesses them at, and how those assessments have changed over time.
We hypothesized that some of these structured fields might predict survey outcomes. To test it, we trained a predictor against the users who had answered the surveys, then evaluated which signals carried the most weight. The model itself is deliberately simple. The specific model family matters less than the features, and we wanted something we could inspect and reason about rather than a black box.
The signals that turned out to predict outcomes most strongly were exactly the ones with the most intuitive relationship to actual business performance:
Whether the user has adopted a cost-tracking template. This is a structured tag in the profile. It captures whether MAIA has confirmed, from conversation evidence, that the user is using a recurring method to track their unit costs. The intuition is direct: you can't manage what you don't measure. Users who adopt a cost-tracking habit are dramatically more likely to report substantial cost reductions.
The trajectory of business sophistication over time. MAIA's diagnostic rates each user across the eight capability pillars. The static assessment matters somewhat, but the *direction of movement* matters more. Users whose pillars are improving over months tend to be the users showing real outcomes. The improving slope is itself a signal of an engaged, learning operator.
Pillar regression, the warning sign. When MAIA's assessment of a user's capability *decreases* on a pillar between rebuilds (for example, from "Proficient" back to "Developing"), it's usually because something concrete went wrong: a habit lapsed, a process broke, something didn't get re-implemented. Regressions are rare but informative. Users with multiple regressions over their lifecycle are systematically less likely to report impact.
Whether the conversation contains evidence of completed actions. This one we extract from the conversation itself, not just structured fields. When a user reports, in their own past-tense words, that they did something MAIA proposed ("subí el precio a S/12," "ya no compro de ese proveedor," "ahorré como S/200 esta semana"), it's the strongest single predictor. Adoption is the whole game, and reported execution is the visible signal of adoption.
There are other signals in the model, and the exact weighting matters less than the picture they paint together: a user who tracks costs, whose capabilities are growing, whose progress isn't regressing, and who reports concrete actions taken. That's a user who's likely to report impact when surveyed. None of this is mysterious. The signals work because they describe the underlying mechanisms by which a small business actually improves.
The validation step
First, calibration on respondents. For users who *did* answer the surveys, we compared their predicted outcome distribution (developed blind to the outcome variable) to their actual responses. The model was accurate enough that the cohort-level predicted impact rate matched the cohort-level actual impact rate within a few percentage points, and an AUC of 0.83.
Second, and more importantly, a feature-distribution comparison on non-respondents. The crucial question was whether non-respondents look similar on those key input features. If non-respondents systematically had lower cost-tracking adoption, fewer action reports, more pillar regressions, then our surveys would be capturing only the happy customers. If they didn't, the silence isn't correlated with poor outcomes.
The distributions overlapped closely. The non-responding active users look, by every signal we can measure, much like the responding active users. This is reassuring, but it comes with a limit we name below: the check only covers the features we have. If non-respondents differ on something we don't observe, we can't see it from this analysis alone.
What this enables
A few useful things follow from having a working predictor.
Population-scale impact monitoring. We no longer need to wait for survey responses to know whether a recent product change moved the needle. As a preliminary check, we can score every active user weekly and track the predicted impact distribution over time. Surveys remain the ground truth at 60 and 90 days, but the predictor gives us a leading indicator on a much larger N.
Earlier and broader experimentation. When we test a prompt change or a new coaching mode, we can score the user population both before and after to see if the predicted distribution shifts. We don't need a fully powered RCT to see whether a change is plausibly working. The predictor offers a fast, cheap signal that we can validate later with surveys.
Honest segment reporting. We can break down predicted impact by country, sector, acquisition source, or any other axis without losing statistical power to small response counts. Partners in Panama, Peru, and Colombia each get a meaningful read of how MAIA is performing for users in their region without us having to collect hundreds of survey responses per segment.
Catching regression early. A user whose predicted impact score drops over time (capability pillars regressing, no execution evidence, no template adoption) gets flagged in our internal review. We can look at what changed in their coaching and either intervene or learn.
The thing we're not claiming
We are not claiming the predictor *is* impact. It's a model of how observable conversational and profile signals correlate with self-reported business outcomes. If the underlying surveys are themselves biased in some way the predictor doesn't capture, because both the predictor and the survey draw from the same observed signals, the model would inherit that bias. That's why our true north will always be external impact evaluations with a clear identification strategy.
We're also not claiming this is the right approach for everyone. Profile-based prediction works for MAIA because we have a structured profile that captures meaningful business state. A simpler chatbot without that profile wouldn't have the inputs to do this. The general point, *triangulate self-report against signals from the conversation itself*, is portable. The specific implementation isn't.
Where this goes next
A few open questions we're working on:
- Are the predictive signals stable across cohorts and time? Our validation was on a recent cohort. We need to re-validate quarterly as the user base evolves and as we change the product.
- How much does the predictor improve when we add longer-horizon outcome data? Right now we validate against 60 and 90-day surveys. As we accumulate impact data at 6 months and 12 months, we expect the picture to deepen, particularly on the sales side, where we hypothesize wins compound over time.
- Can we predict the *type* of impact (cost vs sales) and not just whether impact occurred? The two paths look mechanistically different, and the predictive signals may differ between them.
The measurement problem we opened with isn't unique to MAIA. Any coaching service that interacts with users over time, collects partial survey responses, and wants to know its true impact distribution is facing the same question. We've put a first answer on the table for our case. If you're working on something similar, we'd love to compare notes.
