July 5, 2026 · Bailey Klinger

A/B Testing Update!

We have recently completed a round of pre-registered A/B tests on MAIA's proactive coaching touches, using Evidential, the open-source experimentation platform built by IDinsight and The Agency Fund. Every test was registered before launch with a primary metric, a minimum window, and a decision rule, and several ended in nulls which we also share because often null findings are the most useful.

What we tested, and what happened

Re-engagement hooks work. Our biggest completed test asked how to reach users who have gone quiet. The old default was the standard weekly tip customized for their business. We tested it against two interactive one-tap "hooks": one that names a common business problem and offers help, and one that offers to build something concrete for the user. Across roughly 4,300 sends, both hooks beat the tip on positive engagement (a positive button tap or a real written reply): 8.3% for the passive tip versus 10.5% and 11.1% for the two hooks, both statistically significant. Just as important for a WhatsApp service, the hooks cut opt-outs to roughly a third of the passive tip's rate. Asking a silent user a genuine question, with a one-tap way to say yes, is more respectful of their attention than broadcasting advice at them. We retired passive tips for these users and shipped both hooks in rotation.

Personalized "memory" did not. The obvious next step was tighter personalization: instead of us selecting the offer, anchor it in something the user told us months ago ("a while back you told me about X, so I prepared Y"). This is the kind of feature that demos beautifully. In a three-arm test (about 2,800 sends), the memory-anchored offer performed identically to the MAIA-selected offer: 16.8% versus 16.5% positive engagement, a difference well within noise. The personalization machinery recalled a real profile fact on 98% of sends, so it worked as designed, but it didn't provide lift over MAIA itself selecting the offer based on its own analysis of the user profile and diagnostic.

Check-in "flavors" did not matter. We also tested two variations of our 23-hour check-in message, one that delivers the first slice of the promised work inline instead of offering it, and one that acknowledges the user's engagement streak. Across roughly 3,800 check-ins, neither beat the plain version.

The pattern across all of it

Putting the three finished tests together, the general lesson was that message mechanics move replies and better hooks win the tap, but marginal effects are small: three experiments put our best message formats within about a percentage point of each other. Accumulating small improvements to these metrics is helpful, but there is less juice to squeeze out of these smaller message framing mechanics with each test, and the messaging lever in MAIA is close to saturated. Engagement was never the goal, it is just the first step of a coaching loop that is supposed to end with an entrepreneur doing something different in their business. That gap between attention and execution is exactly the adoption gap MAIA exists to close, and while we were doing this round of message mechanic tests on response rates, we were building the back-end mechanics for the next round of tests to move upstream.

What is in the field now

The new primary outcome is not response rate or engagement: it is verified execution of the discussed practice/improvements within 28 days, measured by REX, our behavioral measure that scans conversations for evidence a recommended change actually happened. We have built REX to target business improvement rather than engagement. REX directly predicts business financial improvements out of sample in our data with an AUC ≈ 0.7.

Two tests use this target.

What MAIA recommends. We are making increased use of peer data within MAIA, now that we have critical mass in our core operating countries. The coaching engine now ranks the practices a specific business has not yet adopted by their association with profits among similar businesses using MAIA, and proposes the highest-value gap. The test separates the two ingredients of this idea. In one arm, the peer data works silently: the practice is targeted this way but presented to the user like any other coaching suggestion, testing whether proposing the right practice is what moves execution. In a second arm, MAIA proposes the same targeted practice but also shows the user the peer comparison behind it (the most advanced businesses like yours do this, and you do not yet), testing whether that social proof adds motivation on top of correct targeting.

How MAIA coaches. The largest test moves the experiment into the core reply prompt itself: does leading every business exchange with the smallest finished piece of useful work (then going deeper on request) beat the classic diagnose-then-advise coaching shape? And does adding an ownership layer, where the user states their own next step and when they will do it, produce more follow-through than plans MAIA proposes and the user politely ratifies? And as with the others, this A/B test directly targets whether or not users actually implement profit-increasing changes, rather than engagement.

The bottom line

The two key shifts in our A/B testing: 1) moving away from behavioral economics-type wording and framing changes to deeper tests of coaching targets, and tradeoffs between diagnosis depth and delivering useful assets and finalizing plans. 2) getting testing closer and closer to the key outcome that we and our users are here for: business improvements.