How to Help AI Write Better Tests

TL;DR: AI-generated tests verify that code runs, not that it's correct. Custom commands, a quality gate, and coverage targets turn AI into a genuinely powerful testing partner.

AI can write 30 tests in two minutes. Getting it to write 5 that actually matter takes a bit more work. The problem: AI-generated tests verify that code runs, not that it's correct. Here's how to start fixing that.

Over the last few months, Cursor has been my main tool for writing unit tests. The stack is React, Remix, Vitest, and React Testing Library. After going back through 58 chat sessions — 18 of them involving testing — here's what worked, what didn't, and what's worth trying.


The workflow

For this experiment, tests always came after the feature was done. Build the feature in one session, open a new session, write the tests. That's not a stance against TDD — it's the easiest place to start when exploring AI-assisted testing.

That separation turned out to matter. When AI builds a feature and writes tests at the same time, it optimizes for making the tests pass, not for covering real behavior. Splitting them keeps each session focused.


Custom commands

Cursor lets you define custom commands — reusable prompts that you can trigger with a slash command. Two of them turned inconsistent test coverage into a repeatable process.

/init-test — writing tests

This command gives AI a structured workflow for test creation. Here's a simplified version:

### Before writing any test, read the target file and extract:

1. All `data-testid` values — list them. Never guess test IDs.
2. Shared constants used in rendering (e.g. `EMPTY_TEXT`, enum values) — use these in assertions, don't hardcode strings.
3. Feature flag versions — what version enables each feature? Test both sides.

### Mock strategy

Start with zero mocks. Only add mocks when strictly necessary. For each mock, justify it. If you can't explain why the test fails without it, remove it.

### Workflow

1. Write tests one case at a time, not all at once.
2. Run coverage: `npm test -- path/to/spec.tsx --coverage.enabled --coverage.include=path/to/component.tsx`
3. Fill gaps for any uncovered branches.
4. Aim for >95% branch coverage. Document anything intentionally untested.

### Assertion rules

- Every assertion must verify the correct value, not just existence.
- Never use optional chaining in assertions. Tests should fail loudly.
- Use `toHaveBeenCalledWith(value)` not `toHaveBeenCalled()` alone.

These are simplified extracts. The real commands are longer and tailored to the codebase — referencing specific test utilities, project conventions, and known pitfalls. The key ideas: read the code before writing tests, start with zero mocks, use coverage as the driver, and write one test at a time.

/init-post-test — reviewing tests

This command runs after tests are written. It's a quality gate:

### Assertion strength audit

- Scan every `expect()`. Flag any that only check existence without verifying the correct value.
- `expect(x).toBeDefined()` without a value check → add `expect(x).toBe(expectedValue)`
- `expect(mock).toHaveBeenCalled()` without args → change to `toHaveBeenCalledWith(...)`

### Mock review

- For each `vi.mock()`: is it necessary? Can the real implementation be used instead?
- Count total mocks. If >5, question whether the test approach is right.

### Element queries

- All queries should use `data-testid`. Verify every test ID exists in the component.

The first pass of tests is never perfect. Having a second command for reviewing test quality catches things the initial pass misses — weak assertions especially. Across sessions using these commands, branch coverage consistently landed between 90% and 100%.


Lessons from the mistakes

AI makes predictable mistakes. Once you know them, you can prevent most of them by adding rules to your commands. Each mistake below led directly to a new rule.

Weak assertions

AI defaults to existence checks. The test passes, but it's not verifying behavior.

// What AI writes
expect(cookie).toBeDefined()
expect(mockSubmit).toHaveBeenCalled()

// What it should write
expect(cookie).toBe('session=abc123; path=/; secure')
expect(mockSubmit).toHaveBeenCalledWith({ action: '/api/save', method: 'POST' })

The rule that fixed this: "Every assertion must verify the correct value, not just existence."

Guessing instead of reading

AI guessed data-testid values instead of reading the component to find the actual ones. It would generate something like getByTestId('submit-button') when the component actually had data-testid="form-submit-action". The fix: "Don't guess. Only try something twice. If it doesn't work, stop and say you're guessing." That single rule cut down wasted iterations more than anything else.

Over-mocking

One session had ~20 mocks for a single component:

vi.mock('@remix-run/react')
vi.mock('~/hooks/usePermissions')
vi.mock('~/hooks/useVersion')
vi.mock('~/hooks/useToast')
vi.mock('~/helpers/cookies')
vi.mock('~/helpers/formatDate')
vi.mock('~/helpers/validation')
// ... 13 more

At that point you're testing your mock setup, not the component. The shift to "start with zero mocks, justify each one" changed the quality of tests completely. Pulling logic into pure helper functions — input in, output out, zero mocks needed — made this even easier.


The prompts that matter

Effective:

  • "Can you write tests until the coverage is 100%?" — measurable target, AI knows when it's done.
  • "Why do we need these mocks?" — challenges assumptions, drives minimal mocking.
  • Pasting exact error output and asking for 3 possible explanations before fixing — forces thinking over guessing.
  • Scoped requests like "This function now filters archived items, can you add a unit test for that?" — specific and actionable.

Ineffective:

  • "Update the tests to the changes made." — too vague. Which changes?
  • Letting AI figure out API behavior on its own — it guesses. Provide docs or tell it to ask.

A note on data-testid

Testing Library's guiding principle is to query the way users do, which favors getByRole over data-testid. That's a good default. But in a codebase with i18n, getByText breaks every time translations change. And getByRole doesn't always map cleanly to custom components. For this project, data-testid turned out to be the most reliable selector — translation-independent, refactoring-resilient, and explicit. It's a tradeoff, not a universal recommendation.


The takeaway

AI doesn't write good tests by default. It writes tests that pass. Those are different things. But with the right structure — custom commands, a quality gate, coverage targets — it becomes a genuinely powerful testing partner. The approach works because it treats AI like a capable but overconfident junior developer: clear rules, measurable goals, and a review process.

When something goes wrong, don't just fix it — write it into the commands so it never happens again. That loop — mistake, rule, better tests — compounds over time.