Hallucinations

LLMs are non-deterministic, which makes it difficult to predict how they behave in the real world (aka production). You can give the same prompt but with an extra space, and get a completely different output. This is the byproduct of using a non-deterministic model.

What are hallucinations?

Hallucinations are when the model output is plausible-sounding but incorrect. You will undoubtedly run into this, sometimes even without realizing it.

The running joke in the industry is that hallucinations are a “feature not a bug”. The reason for this is because the model’s capability to be creative comes from the hallucination. The primary mitigation technique for hallucinations is grounding it in real world data (more in RAG).

Battling hallucinations is an open problem. The underlying cause is that LLMs use text input and output. One question can be asked in a thousand different ways…testing every possible scenario is impossible. We use in-house techniques to build a thorough testing suite before releasing LLMs to production. Another approach is installing guardrails to a prompt’s input or output (more information here). As you can imagine, this takes just as much time as building the model.

LLMs should not be used for mission-critical applications

The lack of control over hallucinations means LLMs should not be used for mission critical applications. Even if GPT-4 gets the right answer 9999 times, the 1 time it is wrong can result in devastating consequences.

We typically recommend a human review through the response before making an AI-powered decision. An alternative is to constrain the output (for example Yes/No), and have a computer program evaluate whether the LLM's output is valid.

It’s important to note this recommendation is based on the current state of the technology. In the next 5-10 years, things may change as the technology continues to move drastically.

Most LLM applications are employee-facing

How often do we see LLMs in public, besides the traditional chatbot? Not many. Hallucinations make it impossible to trust ChatGPT to represent your business and reputation.

The bulk of AI applications seem to be focused internally. Supercharging employee productivity is the most popular AI use-case for B2B. Some examples include creating slide desks, drafting reports, or making sense of qualitative customer data. The value is real because an LLM can do these tasks in a few seconds for a few cents, instead of a few hours for tens of dollars by employees.

Using an LLM internally means the human steering the model can guide and correct any potential hallucinations.

Conclusion

“How to avoid LLM hallucinations?” is quite literally the billion dollar question. Though this was a short guide, hallucination as a point of failure is the primary concern when determining whether LLMs are a fit for your use-case.