LLMs are Probabilistic, Until They Aren’t

Apparently, Anthropic’s agent.md FrontMatter is case-sensitive. Yep. If you use the tools key, all of your tools must be written as proper nouns: Bash and Grep work, bash and grep do not. How did I discover this?

I built two agents to check something I’m working on. It was important they didn’t hallucinate, so I used Anthropic’s guidance on avoiding hallucinations. I spun them up, had them generate a report which looked great, fed that to another agent with the task to fix the issues raised, and it came back with a surprising conclusion. Every single issue raised by the agents was in fact a hallucination. Sixteen issues, sixteen line numbers from code, sixteen severity indications. All a lie.

I’ve had LLMs hallucinate before; we all have. I’ve never had anything like this. It was catastrophically bad. I checked my context window, and it was fine, well under 50%. Checked the agent markdown, and all the guardrails were there. I asked the main LLM thread what went wrong, and it suggested that the tools in the FrontMatter had an improper case. I thought that was absurd; FrontMatter doesn’t have strict formatting for its values, and all these tools were command line tools which are almost universally all lower-case, and these tools are famously probabilistic; surely they’re able to infer that bash is the same as Bash. I pushed back on that as the root cause; the severity of the fabrication was so high that my confidence in the LLM was at an all-time low.

I spent 15 minutes or so working with the LLM, iterating through different configurations on the agent with the same 100% fabricated results. Turns out it was the tool name case. Once that change was made, a new report was generated with actual citations, actual line numbers in the code referenced, and validated as real.

It’s a curiously restrictive requirement and dangerous in that it failed confidently. I was fortunate that the results were always intended to be reviewed, but if its outputs had no review step, it could have been disastrous. At the risk of adding another turtle1 I think I might create a peer review agent with the sole purpose of checking reports like this for erroneous citations and files.