Hallucination risk is changing. Quickly.

May 22, 2026·By Bailey Klinger

MAIA has been running on WhatsApp for Peruvian, Panamanian, Colombian and other Latin American micro-entrepreneurs and smallholder farmers for two and a half years. Over that period, the hallucination problem has been a central element to the work, and has changed in very interesting ways. We've observed three general shifts in this period, each requiring a different response. It really shows how important constant monitoring and improvement is when using a rapidly evolving tool like AI.

Phase one: fixing data distribution (2023-early 2024)

When MAIA launched, the frontier models of the day were good at some things, but not as good what our users needed. A small store owner in Cusco asking about how to register with SUNAT, or a coffee farmer in Huila asking about a specific fertilizer subsidy, would routinely get answers from generic AI that were for silicon valley entrepreneurs, with made up details to fill the gaps. This was simply a data distribution problem. The open web that trained these models is overwhelmingly North American and European. The specific procedural and regulatory knowledge our users care about lives in PDFs on ministry websites, in trainer manuals, and in the heads of local extension officers, and not well represented in pre-training.

The fix was country-specific retrieval-augmented generation (RAG). We built knowledge bases of the actual documents our users were asking about: curriculum content, ministry guidance, local regulatory texts, and partner-supplied training material. We forced the model to answer from retrieved chunks rather than from its priors. Where the retrieval came up empty, we had the system say so rather than improvise.

That worked well. It was a lot of work to build and maintain, but it dropped the hallucination rate on regionally specific questions to something acceptable for a coaching context.

Phase two: the chatbot sweetspot (mid 2024 to mid 2025)

But over time the frontier moved. GPT-4 class models, then the next generation, then the one after that, all got dramatically better at factuality. This can be seen in our data and in third party benchmarks. Eg the Vectara hallucination leaderboard started showing top models below 5% on grounded summarisation tasks. Anthropic's CEO went on record at a developer event suggesting frontier models hallucinate less than humans on some factual tasks. For us, this period was the easiest stretch in terms of hallucinations. The same RAG pipeline that took heroic effort to build in 2023 was now backstopped by a model that, even without retrieval, was much less likely to invent a Peruvian ministry that does not exist or a tax regime that was repealed three years ago. All the neat little prompting tricks that used to be important (eg 'you are an expert in X') were simply built into the models by the providers themselves and didn't matter much anymore. This allowed us to focus on other things like proactive engagement and nailing the coaching loop. Life was good.

Phase three: the overconfidence era (late 2025 to now)

In the last several months something has shifted. Model improvement is accelerating and aggregate hallucination benchmarks are improving. But these headline numbers are not what production deployments are showing. What we have observed, and what independent researchers are now documenting, is a change in the character of the hallucinations. Models follow complex multi-step instructions far better than they did 18 months ago. They produce structured outputs more reliably. They are excellent at coding and at chaining tool calls. But when they are wrong, they are wrong with more conviction and less hedging than the previous generation. They refuse less. They volunteer fewer "I am not sure about this" caveats. The same training pressures that make a model good at agentic coding, where the right move is to commit to an action and execute, appear to be reducing the model's willingness to express uncertainty in factual responses, making them in some ways worse hallucinators than before.

This is not just our impression. OpenAI's own evaluations show that o3 and o4-mini, their reasoning-focused models, hallucinate more than their predecessors o1 and o3-mini, despite being smarter on benchmarks. Independent analysis attributes this to reduced refusal rates: the newer models answer questions they should have declined, and when they answer, they sound just as confident as when they actually know. Comparative benchmarks released in early 2026 show one major frontier model hallucinating at over 80% on the hardest factual recall tests, while still leading the field on raw capability. Another lab's flagship hallucinates less often but produces confident fabrications when it does fail, making the errors harder to spot than the more obviously confused output of older models.

Model providers are post-training their flagships for the use cases that pay the bills, which are coding assistants and agentic workflows. Both reward decisiveness and instruction-following. Neither rewards epistemic humility about an emerging market's agricultural regulations. The optimisation pressure is pointing away from the behaviour we need.

What this changed in our engineering

The implication for us was concrete: internal tests of new frontier models started showing worse rather than better coaching performance without proper adapations in everything else around the model (prompt, RAG, programming). The marginal hallucination that gets through to a user today is more likely to sound right than the ones we caught in 2023, which means downstream checks matter more.

Over the past several months we have added a layer of programmatic checks to MAIA that sits between the model and the user. These are not glamorous and they are not novel research. They are the boring engineering equivalent of a second (third, fourth) pair of eyes, and aren't just LLM but also old school programmatic protections. For categories of claim where we know there are risky failure modes, we force validations and check them. Where validation fails, the system either rewrites with stronger grounding or falls back to a safe answer that acknowledges the limit of its knowledge. Those acknowledgements used to be done by the models themselves, but that is less so the case with new models. This is in line with where serious production AI is heading. The 2026 consensus, visible across enterprise guardrail vendors and academic work on hallucination detection, is that grounding plus runtime validation is now the standard architecture, not a paranoid extra. The era when you could ship a wrapper around a frontier model and trust it on facts is over.

What this means for you

Building an AI business coach, or any other AI tool, is a rapidly moving target. Old models are deprecated, and new models provide significant potential improvements for your tool, but could really hurt performance if not incorporated carefully. Particularly for an AI coach as frontier labs all chase coders. Constant monitoring and adaptation are critical.