Tools to manage AI hallucinations

TL;DR: AI will hallucinate no matter how good your prompt is. Here's a mindset shift and four practical tools to catch and fix hallucinations before they reach production.

AI hallucinations are a never-ending source of memes on the internet. If you go on Reddit or have developer friends, you've probably seen more than a fair share of them.

This isn't a new topic. That's the reason we have AGENTS.md files, why we give AI tools super-specific instructions, lots of context, etc. In my experience, the aforementioned techniques are helpful, but they don't fix some specific problems I have. This is a collection of useful tools and, most importantly, a mindset shift in how to work with AI to produce output that is up to the standards of a production codebase.

I use AI every day to write hundreds of lines of code and this number is only growing every day. I notice mistakes happen that aren't necessary, which is why I'm constantly looking for ways to limit the chaos and use AI more effectively.

Mindset shift

What really changed my approach to working with AI were the following realizations:

AI will hallucinate even when your prompt is perfect and specifically says something like "don't hallucinate" or "don't make anything up". It's not a matter of if, but when.
Currently, we're using AI to write the plan, the code, and the tests that check if the code is correct. The checks are performed by AI but might not be 100% correct. It's like asking a student to grade their own exam.
LLMs produce legit-looking code, and we have no control over when that code is correct and when it needs intervention. It's not 100% deterministic, so we need to find ways to sprinkle some hard checkpoints along the way to catch and fix this.

What does this mean for our mindset?

Always look for ways to verify the AI's hypothesis with numbers, logs and facts.
Find tools that can automatically and 100% deterministically verify that code is correct and works.
Look back at tools you didn't use in the past because they were hard or complicated. They might be a solution to a current problem.

Here are four tools I use to help manage the chaos. My area of expertise is frontend development, and while some of the tools won't be 100% applicable as they are described here, you might find inspiration to use another tool that will work for your domain.

1. Writing really good tests

You've probably prompted AI to write tests for you and gotten mixed results. Sometimes the test is genuinely good and covers the scenario perfectly, but other times it mocks away everything and forgets to insert an assertion. Let's take a look at how we can fix that.

Scaffolding tests

In companies where I worked, coverage was always considered more of a vanity metric, and I never really looked at the numbers. It turns out coverage is perfect for writing a first draft of tests.

Coverage tracks which lines, branches, or functions were executed by the tests. It doesn't catch 100% of the edge cases, but it's an amazing tool to start with. You prompt AI with "get the coverage for a given file and write tests that cover all the lines that were not covered". Coverage tells you what code ran, not whether it ran correctly, but it's a great place to start.

Fixing missed cases

The problem with coverage is that it can be 100%, but the tests can still:

Make wrong assertions
Miss edge cases
Go into branches but not into every single condition

The way to fix that 100% deterministically is to use a tool like Stryker. What does Stryker do? It mutates the code and runs the tests. If the tests fail, it means the mutation is not covered and we need to write a test for it.

You can't run Stryker every time you write a test for a new file, because it takes a while to run. So I recommend using it in a targeted fashion.

Make AI clean up the code

Now that you have tests that confidently cover most of the code, you can ask AI to clean up the code according to your preferences. This is mostly an improvement to make code adhere to your project's standards, conventions, and personal taste. For instance, I ask AI to:

Never use getByText, because copy might change. So the AI checks that we use data-testid instead.
Don't check spy.toHaveBeenCalled; use toHaveBeenCalledWith instead, because then you can verify we pass the correct arguments.

This is fine to delegate to the AI because the tests are already covering the code, and we're fixing "subjective" preferences.

2. Reducing hallucinations by fixing naming conventions and enforcing them

I use the Figma, Jira, and Notion MCPs almost every day. They're awesome for getting the requirements, making a plan, and executing it. When code is written by a human, they implicitly know what things mean. They understand that what Jira calls AIPanel is called LiteratureAIChat in the code, or that what Figma calls background-default is called color-background-default in the code. It's small enough for a human to remember — and small enough for AI to make up.

That's why it's crucial to name things consistently. It's also really easy to fix: you just ask AI to do it for you. Recently I renamed all the CSS variables in the codebase to exactly match Figma, and color hallucinations have dropped significantly. It's not 100% perfect, though, because new colors can be added in Figma but not in the codebase yet.

Custom linters

To deal with made-up CSS tokens, I implemented a custom linter. We use Biome, and it's really easy to use AI to write one. This way you have an instant, 100% accurate way to make sure AI didn't make up a token. Additionally, we can catch early on if there are new tokens in the design system that don't yet exist in the code.

When it comes to custom linting, your imagination is the limit. I also wrote another linter that makes sure we use reusable components instead of raw HTML. For example, <p>Hello World</p> is not allowed, but <Typography>Hello World</Typography> is.

3. Making sure semantic HTML is used

Another problem I saw was divs used for buttons with onClick handlers, or headings used when a paragraph should've been. This isn't a new problem; semantic HTML is hard and always has been. An even bigger question is accessibility. It's a big knowledge area, and not many people focus on it, so again making mistakes is very easy.

The issue is that if you ask AI "fix accessibility issues", it might come up with actual problems, or it might make up an issue or make up the fix. And if you don't focus on accessibility (like me), you might not be able to tell whether AI solved the issue or not.

So how could we improve that? The answer is to look for tools that check this with 100% accuracy. For instance, we use Axe to check for accessibility violations. It shows you immediately in the browser console if there are any violations.

Moreover, if you're using a cloud agent, you can ask it to run the app and use the Axe CLI tool to get the violations and fix them. This way you have a reliable source that tells the agent what it needs to fix. We move from AI guessing what to do to AI fixing real problems.

4. Sifting through big datasets to find patterns

Recently, I had to fix a React performance issue I had no idea how to fix. I recorded a session in the React DevTools Profiler, exported the JSON, and dumped it into Cursor with "here's some data, figure out a pattern". It found that one component was re-rendering around 3000 times during a single streaming response because of a prop reference that changed on every token. (Full write-up here.) Nobody can manually read through a Profiler JSON dump, but AI doesn't mind.

The more important question is: what tools are out there that we didn't use before because they were hard to use or complicated to understand, but AI won't have an issue with at all?

Maybe this is the comeback of snapshot testing? I never really used snapshot tests because manually checking diffs in JSON is quite painful, but for AI it shouldn't matter.

Why this matters

We're writing more code than ever before. We're also writing code in areas in which we might not be experts. It might be that you're a web developer but have no experience with web performance. Or you could be a web developer who is now starting to contribute to backend repositories. That's where tools earn their keep — they let you ship in areas you don't fully understand.

Moreover, the more code we write, the less code we actually read. When a colleague sends me five PRs a day to review with hundreds of lines of code changes, I might not really be able to catch issues that AI doesn't perceive as one. So using tools that verify with high certainty that code is correct is crucial.

There's this broken windows theory: visible signs of disorder and neglect (e.g., vandalism, litter, public drinking) create an environment that encourages further crime. The same goes for codebases — if it's already messy, it's very likely to stay messy. That's what the linters and tests are for: fixing the first broken window before the rest follow. We don't yet know how AI will affect our codebases. Will it magically fix the issues, or will it replicate patterns that already exist?

I'm a strong proponent of making it easier for my colleagues to contribute to the web repo. And I also care about doing things properly. That's why I'm constantly on the lookout for tools that will keep the codebase clean and make it easy for other people to extend it.