The Hypothesis

When does an AI agent with travelling salesman skills stop reasoning and start hallucinating?

The Concept

Give an AI agent optimisation skills — like the travelling salesman problem — and point it at a real-world scheduling challenge: aged care rostering. As the number of carers, clients, constraints, and preferences grows, the decision tree expands exponentially. This experiment tests where the boundary is — at what point does the agent stop making sound decisions and start confabulating plausible-looking but broken schedules?

Decision tree collapse

The hypothesis

When does an AI agent with travelling salesman skills stop reasoning and start hallucinating?

The concept

How it works

Small roster (5 carers)
Agent solves optimally — TSP + constraints + preferences
Scale up complexity — 10, 20, 50, 100 carers
Add real-world chaos — last-minute cancellations, preference shifts
Measure output quality — valid vs. hallucinated schedules

Progressively scaling complexity until the agent’s reasoning breaks down — then mapping the failure patterns.

What it explores

At what scale does the agent start producing schedules that look valid but contain constraint violations?
Can the agent recognise when it’s out of its depth and ask for help instead of guessing?
Does breaking the problem into sub-problems (clustering by region first) delay the collapse point?
What are the early warning signs that an agent is about to hallucinate a solution?
Is there a reliable way to validate agent output without a human reviewing every schedule?

What we found

Agent handled up to ~30 carers with high accuracy
Between 30–50, subtle errors crept in — double-bookings hidden inside otherwise valid schedules, physically impossible travel times
Beyond 50, the agent confidently produced schedules violating multiple hard constraints while reporting high confidence scores
Failure mode was plausible-looking nonsense, not random errors

Learnings

Agents don’t degrade gracefully — they cross a threshold and start producing confidently wrong output, with no warning from their own confidence scores.
Constraint validation must be a separate verification layer — never trust the same system to both generate and validate.
Breaking combinatorial problems into geographic clusters delays the collapse point by 3–4x — decomposition is the primary scaling strategy.
The most dangerous failure mode is 95% correct output — the 5% is invisible without automated checks, and humans assume the rest is fine.

Where it goes next

This directly shaped Wayfinder’s architecture — geographic clustering, hard constraint verification layers, and an agent that knows when to flag a problem instead of guessing. The collapse point research continues to inform how we build any agent that handles combinatorial problems.