Drifting away? AI model performance over time

Is AI getting dumber? We are used to most software getting better over time, as bugs are fixed and new or improved features are introduced by the vendors. We hope that software, like a fine wine, will improve and develop pleasing complexity as it ages and develops. However, this hoped-for progress appears to be eluding the large language models (LLMs) that are used within today’s generative AI tools, such as ChatGPT, Perplexity, Claude, Gemini and LLAMA.

The deterioration in AI performance became visible quite quickly after the general release of ChatGPT in November 2022. Researchers at Stanford and Berkeley tested out the new technology, as mentioned in this Scientific American article in August 2023. ChatGPT was given 500 integers and asked to identify which of these numbers were prime. In March 2023 the AI correctly identified the prime numbers almost 98% of the time. By June 2023 it was scoring under 3%. That was an early sign that LLMs were not following the conventional software pattern of steady improvement. It was generally assumed that large language models would get better as they scaled up, either with more training data or more processing power. This assumption has proved false. An article in New Scientist in September 2024 shows that across different classes of tests (arithmetic, anagrams, scientific challenges, pulling out information from lists, geographical questions) the reliability of LLMs actually worsened. The original study by three collaborating universities is here.

Even a stable, deployed AI model may decline in performance over time (“model drift” or “model decay”) due to several factors. There may be changes to the external environment or user behavior, input data may change, or models may encounter new data that they have not encountered before. This is not just a theoretical issue: a fraud detection model trained on historical fraud patterns may miss newer approaches, or a sales model trained on historical sales data may fail to react to a new market trend. There are techniques that can be applied to detect model drift, but such monitoring needs to be put in place and checked. How many organizations are actually doing this? How many are scrutinizing the performance of their AI models over time, being alerted to issues and carrying out remediation where needed?

It has now been two and a half years since the release of ChatGPT, and recent developments are troubling. AI hallucinations are worsening with the latest AI models, as discussed in this New Scientist article in May 2025. Open AI’s own report showed that its very latest models have increased hallucination rates, going from 16% in the 01-mini model up to 48% for its most recent 04-mini model. The same thing has happened with DeepSeek, according to AI research firm Vectara.

Why is this? No one seems sure. Some theories are explored in this article, including stale training data. One idea is that AI has poisoned the well of its own training data. Before November 2022 almost all data on the internet was created by humans, but since the release of ChatGPT and its rivals, there has been an explosion of AI-generated material. One article by Amazon Web Services reckoned that over half (57%) of all the internet content is now AI-generated; much of it is machine-translated, with all the hallucinations and errors that implies. AI models have been shown to collapse quickly when trained on recursively generated data, as discussed in this article in Nature in July 2024. This affects not just LLMs but other types of AI models, as discussed in this TechTarget article in July 2023.

What is concerning is that much of the corporate world seems blissfully unaware of the issue. The press is full of articles praising AI based on the latest pronouncements of industry experts, many of whom just happen to be either running AI companies, investors in AI companies or consultants implementing AI in corporations. People in these positions have a vested interest in ensuring that the AI bubble continues to inflate, driving up valuations of AI companies and increasing investment in AI. Pundits predict the end of careers in coding, amongst other things, though even in this well-established use case, there are issues. There is little doubt that LLMs are useful in programming, at the very least for prototyping, though using it for production code may be more problematic. One person predicting the end of programming is Anthropic CEO Dario Amodei, who predicted in March 2025 that by June-September 2025 most code would be written by AI, and by March 2026 that AI would be writing “essentially all code”. However, I observe that the Anthropic website in May 2025 was full of adverts for highly paid jobs in software engineering at Anthropic itself: a case of “AI coding for thee, but not for me”.

Investors are beginning to notice that all is not quite as was promised, as this Goldman Sachs article from June 2024 and this March 2025 article in Forbes magazine observe. Some observers think that generative AI may be a bubble. However, as Al Gore discovered when he tried to draw attention to climate change, inconvenient truths are hard to get across when there is a world of vested interests that don’t want to hear your message. A 2025 IBM survey of 2,000 CEOs found that just a quarter of AI projects deliver their expected return on investment. Despite this, they are still mostly investing in it, and only a few companies like the Swedish fintech company Klarna have publicly done a U turn on generative AI.

Hallucinations in generative AI occur when a model extrapolates or creatively interprets its training data, blending learned elements in ways that result in outputs that are nonsensical or inaccurate, caused by overfitting in training data, model complexity and potentially other things. Hallucinations are simply not going away, though one important caveat is that the issue of hallucination in particular is primarily associated with generative AI models (LLMs) and that not all types of AI are affected in the same way. Because of the huge media interest in generative AI, it is easy to forget that there are other flavours of artificial intelligence, many of which have been happily running for years without a flicker of hallucination. Regular rules-based systems do not hallucinate, while the machine learning systems, which have long been used for merge/matching in data quality systems amongst other applications, do not have the same level of issue. There are other types of AI models, such as deep learning systems like the protein folding model AlphaFold by DeepMind, though this kind of model can have hallucination issues of its own, by misinterpreting inputs or overfitting to noise in data. A computer vision model may also “see” objects in an image that are not there. The separate issue of model drift over time affects pretty much all AI models, so this, caused by either user behaviour, market trends, or seasonality that alter the patterns present in the data presented to the AI model, needs to be carefully monitored.

It is undeniable that study after study has shown that AI model performance deteriorates over time and needs to be carefully scrutinised. The torrent of money poured into AI development is, at least so far, not translating into significantly better-performing models. Indeed, the evidence shows that LLM reliability and hallucinations are worsening rather than improving. Given this backdrop, corporations jumping on the AI bandwagon would be wise to carefully put in place AI model monitoring and AI governance structures to ensure that their investments are actually delivering. Without this, there is a danger that, in the words of the song by Dobie Gray, the benefits may drift away.

Drifting away? AI model performance over time

Posted By

Andy Hayler

Related Technologies

Brief our analysts

Categories

Useful links

1 Comment

Leave a Reply Cancel reply

Share on Social Media

Connect with Us

Ready to Get Started