LLMs are Probabilistic, Until They Aren’t

Apparently, Anthropic’s agent.md FrontMatter is case-sensitive. Yep. If you use the tools key, all of your tools must be written as proper nouns: Bash and Grep work, bash and grep do not. How did I discover this?

I built two agents to check something I’m working on. It was important they didn’t hallucinate, so I used Anthropic’s guidance on avoiding hallucinations. I spun them up, had them generate a report which looked great, fed that to another agent with the task to fix the issues raised, and it came back with a surprising conclusion. Every single issue raised by the agents was in fact a hallucination. Sixteen issues, sixteen line numbers from code, sixteen severity indications. All a lie.

I’ve had LLMs hallucinate before; we all have. I’ve never had anything like this. It was catastrophically bad. I checked my context window, and it was fine, well under 50%. Checked the agent markdown, and all the guardrails were there. I asked the main LLM thread what went wrong, and it suggested that the tools in the FrontMatter had an improper case. I thought that was absurd; FrontMatter doesn’t have strict formatting for its values, and all these tools were command line tools which are almost universally all lower-case, and these tools are famously probabilistic; surely they’re able to infer that bash is the same as Bash. I pushed back on that as the root cause; the severity of the fabrication was so high that my confidence in the LLM was at an all-time low.

I spent 15 minutes or so working with the LLM, iterating through different configurations on the agent with the same 100% fabricated results. Turns out it was the tool name case. Once that change was made, a new report was generated with actual citations, actual line numbers in the code referenced, and validated as real.

It’s a curiously restrictive requirement and dangerous in that it failed confidently. I was fortunate that the results were always intended to be reviewed, but if its outputs had no review step, it could have been disastrous. At the risk of adding another turtle1 I think I might create a peer review agent with the sole purpose of checking reports like this for erroneous citations and files.

Pop Culture Product Laws

I find a lot of inspiration for my work in weird places. This is a compilation of those oddities as weird laws.

Malcom’s Law I:

Your scientists were so preoccupied with whether they could, they didn't stop to think if they should

Just because you can build a thing, doesn’t mean you should build the thing. Effective and early UX design provides the Should

Malcom’s Law II

Life, uh, finds a way

Users, uh, find a way. Pave the cowpaths.

The AI Endgame Looks a Lot Like Linux vs Windows/MacOS

If you hang out in the AI/LLM sphere for a while, you’ll inevitably stumble on the token gap. AI providers like Anthropic, Google, OpenAI, et al., are selling you tokens at a significant loss. This is, of course, nothing new. Amazon was famously accused of new math when they perpetually lost hundreds of millions of dollars a year before they didn’t. Uber was significantly cheaper than a taxi until they weren’t. There’s a general sense that the same pattern will play out with AI, and costs will skyrocket when one of the competitors “wins”. So there’s a big push for local models amongst power users. If you have the time, money, and an electrician, you can get close to Sonnet 4.5 quality with a home rig.

I think history will repeat, but a different branch of history than some are predicting. Let’s start with Google and Gemini. After getting caught flat-footed, Google has quickly closed the gap with their Gemini 3 models. We can debate where they fall comparatively to frontier models from OpenAI and Anthropic, but they weren’t even part of the conversation last year. So much so that Apple has selected Google’s AI as the basis of Apple’s Foundation Models. Google and Apple can subsidize the token deficit effectively indefinitely by amortizing the costs across their customer base. Take Apple One, Apple’s “everything” subscription. I pay $40 a month for it. I don’t use $40 a month worth of the service, but it would cost me more than $40 a month for the services I do regularly use, and I get the benefit of easily dipping in and out of the other offerings. It’s like insurance; the user base all pays in, and heavy users subsidize lighter users.

OpenAI and Anthropic can’t indefinitely lose money every quarter, and unlike the examples of Amazon and Uber, the landscape is different this time around. Amazon didn’t have a Google to compete against. Google, Apple, and Microsoft can also amortize the cost across additional revenue streams. They don’t have to win; they just have to wait.

So to me, the future looks a lot like the present. Certain technically inclined folks will take the time and effort to set up local LLM machines, just like they have with Linux. They will have far more flexibility, far fewer constraints and it will be cheaper. Most people, including most other technically inclined people will be subscribed to Google, Apple and Microsoft with the slim possibility of a fourth competitor emerging.

Building in Public

So this is my new site. I haven’t made a personal site in years. I always get stuck. I’m going to get stuck again, but two things are rattling around in my head. First, in my hubris, I think I have interesting things to say and need a place to put them. So they’re going here. The second is Gall’s Law. I discovered this one on my time working on the AstroUX Design System. Gall’s Law states: <blockquote>A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.</blockquote> While I have complex ideas for this site, I can’t start there. So I’m starting as simple as possible – just like the very first web page I ever wrote back in … I don’t know 1993, maybe ’94; back before CSS and JavaScript1 were even concepts; back before I even had a PPP internet connection required for using a GUI browser like Mosaic; back before there were even mobile devices capable of displaying web pages – let alone responsive design, this sucker is going to be one HTML page rendered in whatever default state your browser is going to show it in2.

That’s it. That’s the site. For now.