Ever since the public release of ChatGPT in November 2022, the world has been fascinated by artificial intelligence (AI), and the new generative AI in particular, driven by large language models (LLMs). A third of all venture capital investment in 2024 was AI-related, and a September 2024 study by Amazon AWS estimated that 57% of all web content was at that time AI-generated, which includes articles that had been translated (imperfectly) into other languages by AI. This itself is an issue since the large language models that underpin generative AI depend on large volumes of high-quality training data, and do less well when trained on AI-generated data.

Soon after ChatGPT gained its huge audience in 2023, users noticed that its generated content was not always reliable. The term “hallucination” was coined to describe generated AI answers that contained plausible but incorrect information. A lawyer who used ChatGPT to write a court submission to a judge was sanctioned after it turned out that the legal case precedents written by the AI were entirely made up. Air Canada lost a court case after its Chatbot gave misleading information about its policies to a passenger, and McDonalds scrapped its rollout of AI-driven order-taking at their drive-through restaurants after videos of customers noting surreal additions to their food orders went viral.

These are not isolated incidents. Nor should they be surprising once you examine the way that generative AI works beneath the surface. LLMs are probabilistic creatures, generating plausible content based on their training data. The more training data they have on a subject the more accurate are their replies to prompts. Answers can be disconcerting in some cases as can be seen.

Generative AI is not deterministic in nature, so if you expect the same answer to the same prompt time after time, then you will be disappointed. For some applications, this does not matter, and may even be a good thing. If you want to generate a creative list of names for your new kitten or a new logo for your start-up then creativity may be positive, and indeed AI models can be tweaked to be even more creative by changing their “temperature” setting, which controls the randomness of their replies.

For certain types of applications, hallucinations may be annoying but not too serious. For example, in generating programming code, if the AI refers to a made-up function or library then that error will be picked up by the compiler or interpreter. However, in many other cases, consistency is vital. Applications that spot fraudulent banking transactions, or find airline passengers on terrorist watch lists, for example, need to be accurate and consistent.

Different studies have shown that around 1 in 5 generative AI answers have hallucinations. Some studies find a lower figure, some higher, but it is never, ever zero. This is a feature, not a bug, of large language models. Vendors claiming that hallucinations can be eliminated by better training data e.g. the use of retrieval augmented generation (RAG) are mistaken, though the accuracy of answers may certainly improve to a degree due to the models having a better semantic grasp of the problem domain by using external datasets. This is not a transient situation that will be quickly resolved: the consensus amongst AI researchers is that hallucinations are too deeply ingrained in the nature of LLMs to be eliminated. There is no shortage of claims to the contrary to be found on the internet, but these are generally from vendors, with a vested interest in pushing that line. To repurpose the immortal line from the movie “The Princess Bride” (1988): “Life is pain; anyone who says differently is selling something”. Unlike, say Excel, an LLM will not necessarily return the same answer to the same question each time, so it is not a repeatable process. This underlying issue of hallucination and consistency is a particular problem when it comes to the latest trend in AI, as we shall see.

The AI buzzword in 2025 is “agentic AI”. This is the concept that a network of AIs can be set up to achieve goals, making decisions autonomously, without human intervention. For example, you could have an AI assistant that could check your calendar and your preferences and not merely suggest a holiday itinerary but, with the help of your credit card, actually book flights and hotels for you. This concept, if it can be made to work, certainly has many applications in a range of industries. Such agents could optimise inventory in response to real-time demand, manage computer networks to minimise outages, run customer service chatbots that can pull in relevant data, run a factory floor while optimising energy usage or even (it has been suggested) act as virtual caregivers to the elderly. Agentic AI is about goals, not generating content. The potential applications are almost endless, with one proviso: it has to actually work.

LLMs can be thought of as predictive text generators on steroids. Instead of just predicting the end of your sentence as you type a query, they can produce a whole paragraph or essay in response to your prompt. You could, for example, ask an LLM to write a marketing plan for a new sales campaign. Rather than just accepting this, you can take this draft plan and pass it to another LLM and ask it to critique the first plan. A third LLM could then be invoked, to take the draft and the critique together to produce a new, improved plan. Yet another agent could be invoked to do some research to provide background data or statistics to support the campaign; another agent might apply brand guidelines to the plan. To make it more elaborate, there could be an orchestration agent that is instructed to go through this whole process many times, refining the end product through each loop, using the various tools at its disposal. Sound good?

Now remember the hallucination issue. At present LLMs typically hallucinate at least 1 in 5 times in their answers. You can verify this by asking an AI the question e.g.

“Some studies suggest that LLMs can hallucinate as much as 27% of the time, with factual errors present in 46% of cases”.

Perplexity AI answer to ‘what is the average LLM hallucination rate?”

(by comparison, ChatGPT reckons a 5% – 30% hallucination rate average as of March 2025).

So let’s generously say that an LLM has an 80% chance (4 in 5) of producing a correct answer. If you feed the answer of one LLM to another then the error rate will compound. So after two linked LLM answers the error rate will be:

0.8 * 0.8 = 0.64 (i.e. 64%)

If you continue with feeding this answer to another LLM then the error rate will be

0.8 * 0.8 *0.8 = 0.51 (51%)

etc. If four LLMs are linked up, feeding answers to each other, the success rate is:

0.8 * 0.8 * 0.8 *0.8 = 0.41 (41%)

So if just four LLMs are chained together feeding answers to one another then the success rate of the answers is 41%, i.e. less than half. In how many business processes is that success rate acceptable? Indeed in how many business processes is 80% success acceptable? If you had an 80% success rate in delivering your orders to a customer, is that something that your manager would applaud? If 80% of your products rolled off a production line successfully, would that be OK? Hardly. In manufacturing, companies aim for error rates below 1%. The Six Sigma process, used by companies like GE and 3M and Toyota, aims for error rates below 3.4 in a million. Even the most sympathetic studies of LLMs have shown hallucination rates at a minimum of around 2%, with the rate varying by task complexity, training data availability, and model size. Even 2% is much too high an error rate for a good quality manufacturing process. Whatever the figure, the point is that feeding the results of an LLM to another, and then to another, compounds the error rate. The more LLMs are chained together, the worse the success rate of the final outcome.

If your world model has just a 1% error rate, if you build over 50 or 100 steps, that 1% compounds. By the time you’ve done those 50 or 100 steps, you’re in potentially a random place.”

Demis Hassibis, CEO of DeepMind 

Now ask yourself, given this situation, would you give an agentic AI chain your credit card and let it loose on the internet, perhaps to book a holiday, complete with flights or hotels? Is a 20% error rate OK for that task, in terms of ending up with a flight and a hotel somewhere that you want on the correct dates? Even at a 2% error rate, how comfortable would you be? At a mere 2% error rate, it just takes 34 compound steps before the final outcome error rate exceeds 50%. The more chained LLMs there are, the higher the compounding error rate. Now it should be borne in mind that the various agents do not all have to be LLMs: some could be calls to an application programming interface (API), or a machine learning algorithm, or something else. However, a series of such calls to subroutines linked together is not new and has nothing whatever to do with AI: it is called a computer program. To claim “agentic AI”, there needs to be at least some AI involved.

This compounding error rate is a huge issue for agentic AI to overcome. All you can really do is to try and make the models more accurate, or put a human in the loop. LLM accuracy may be sharpened by confining them to very specific tasks with lots of relevant training data, but as we have seen, there will always be errors, and even a 2% error rate quickly compounds to something much worse if there are enough LLM steps involved. Introducing a human into the loop to check things is fine, but surely this defeats the object of agentic AI, which is to have “agency” and be self-sufficient.

A further issue is security. For the agents to work, they will need access to resources. For an agentic AI travel agent or assistant to book a holiday for you and a friend, it would need access to your calendar, your email or messaging service and your credit card details. The actions of the LLM may be taking place on a cloud server, and communication needs to occur. As pointed out by Meredith Whittaker, CEO of messaging vendor Signal, there is currently no encryption model for this, so there is security exposure to that sensitive data.

Another problem in fully automating existing processes is dependent on the agentic AI chain to be flexible enough to correctly react to unexpected events or inputs, in the way that a human worker should be able to. If you are dealing with customer services, for example, then those human customers are unpredictable. Are you confident that an agentic AI application could successfully navigate all, or at least most, unexpected inputs or events? Would it do so consistently?

Further issues abound, from data quality issues, to demands of agentic AI compute costs, to ethical misalignment to the practical issues of embedding agentic AI within businesses due to organisational boundaries and employee resistance. It has also been noted by regular, heavy users of ChatGPT that the quality of responses can deteriorate over time, due to a phenomenon called “LLM drift”. LLMs may change their responses in reaction to new information (LLMs are usually trained on a snapshot of data, and the world moves on) or model updates. There is also “internal model dynamics”, where LLM behaviour appears to change over time even without external intervention, for reasons that appear to be unclear at this point. LLMs given the same question repeatedly over time (such as whether a certain number is prime) have been observed to get steadily less accurate over time. This means that LLM outputs need to be carefully monitored, especially for applications that actually rely on these models. This variability could be an issue for situations or industries where there is government regulation and the need for compliance with these regulations.

Getting AI to meet its intended goals has a mixed history. In 2018 Amazon deployed (and had to scrap) an AI hiring system that was found to discriminate against women, due to a preponderance of male resumes in its training data. Self-driving cars have found difficulty in reacting to unexpected events, sometimes with fatal consequences. Dealing with edge cases and AI systems having unintended consequences is a significant challenge for agentic AI. One example was when Chevrolet (briefly) deployed a customer service chatbot, and a customer manipulated it into selling him a 2024 Chevy Tahoe for $1, a handy $58,194 discount on the list price.

Although there are plenty of demonstrations of agentic AI from vendors, it is difficult to find real-life case studies. For example, when I looked into this, I found one promising-looking conference talk about a case study, but this turned out to just be a training chatbot project using a single LLM, despite the talk title implying it was agentic AI. I kept looking. There are case studies I found in manufacturing using AI for predictive maintenance, where AI was used to analyse sensor data to predict equipment failure, but again these were not true agentic AI, just an AI doing pattern recognition, usually using machine learning rather than an LLM. The same issue occurred when I explored several healthcare examples where the article title claimed agentic AI but the actual text said something quite different.

Dig into an article labelled an agentic AI case study in 2025 and you will likely find something less exotic involved; real case studies seem at the time of writing to be as rare as unicorn horns. Indeed, when I asked some AIs themselves about this, they happily trotted off a list of projects that, on inspection of the sources, proved not to be agentic AI. I had these responses when I challenged them to find AI use cases that were explicitly about agentic AI, and not just AI use cases in general, which they all happily listed initially:

Finding true agentic AI case studies can be challenging because many applications blend machine learning with automation but do not fully meet the criteria for agentic AI.”

(Perplexity, March 2025)

“After reviewing the case studies I mentioned, I should clarify that none of them have publicly available documentation that specifically describes their implementations as “agentic AI” with multiple specialized AI agents working in chains as I characterized them.”

(Claude, March 2025)

“Identifying real-world implementations of agentic AI—where multiple autonomous agents collaborate in a structured sequence—can be challenging due to the novelty and complexity of such systems. As the field of agentic AI evolves, more detailed case studies may become available.”

(ChatGPT, March 2025)

“Documented, publicly available case studies of multi-agent AI systems in production are still relatively rare”.

(Gemini, March 2025)

“True agentic AI—fully autonomous systems capable of independently setting goals, reasoning through complex problems, adapting to new situations, and executing tasks with minimal human intervention—remains more of an aspirational concept than a widely achieved reality as of March 30, 2025.”

(Grok, March 2025)

Doubtless, there are some true examples out there, and probably there are proprietary cases that are not documented publicly. Nonetheless, in many hours of trying, using both Google and four AI tools, I was unable to uncover a single case study that was convincingly and undeniably agentic AI in production use. At the least, this paucity suggests that the area is not exactly brimming over with real-life production use cases. This may reflect the novelty of the field, but also may reflect the fundamental issue of reliability that was highlighted above.

Even the five AIs that I tried were unable to find (or even hallucinate) a single documented use case that stood up to scrutiny. With generative AI’s inherent lack of consistency and hallucination rates occurring at the rate that they currently do in LLMs, it is hard to imagine many use cases where true agentic AI would be deployed in production use for mission-critical processes. Agentic AI has future promise for sure, but at the moment it is just a dream, or perhaps I should say a hallucination.