Nobody reads the confidence score until it’s wrong

We ran the Decision Fatigue experiment expecting to learn about cognitive load. We did — but the thing that stuck with us was about trust.

When the AI recommendation came with a 92% confidence score, people followed it without thinking. When it came with a 68% score, they ignored it completely. There was almost nothing in between. No nuance. No “well, 68% is still better than a coin flip.” It was binary: high confidence meant trust, medium confidence meant distrust.

The interesting part: when a high-confidence recommendation turned out to be wrong, trust collapsed — not just for that recommendation, but for the next five. One bad call at 92% did more damage than ten mediocre calls at 70%.

This has implications for everything we build. The temptation is to only surface high-confidence recommendations. But that trains people to stop thinking critically about AI output. And then the one time it’s wrong, the fallout is worse.

We don’t have the answer yet. But we’re starting to think the confidence score itself might be the wrong interface. Maybe the system shouldn’t say “I’m 92% sure.” Maybe it should say “here’s what I’d do and here’s what I’m not sure about.” Show the reasoning, not the number.

Something to test.

Nobody reads the confidence score until it's wrong

Nobody reads the confidence score until it’s wrong