The Illusion of Determinsim
A few months ago, I participated in a company-wide hackathon with a group of data engineers. The project was straightforward enough: build tooling to automate some of our data modeling conventions – tasks like auto-generating documentation, scaffolding schema files, and enforcing naming patterns. The kind of stuff that everyone agrees should be standardized but nobody wants to do by hand.
During one of our syncs, one of my teammates pushed back on using LLMs for code generation. And that's when the room split:
- "LLMs are non-deterministic. We can't guarantee the outputs."
- "Humans are non-deterministic too. That's the whole reason we're building this tool."
The first concern is valid. LLMs are indeed probabilistic, and you can run the same prompt twice and get different results. But the second response was harder to dismiss. We were building a tool to enforce consistency that humans had failed to maintain, yet hesitating to use AI for fear it might not be consistent enough.
That contradiction stuck with me, and I think it points to something bigger than the hackathon.
The implicit hierarchy
Much of the skepticism around AI in engineering workflows rests on this assumption:
- Humans → predictable, reliable, consistent. The deterministic baseline.
- AI → probabilistic, unpredictable, risky. The non-deterministic wildcard.
We hold these two systems to wildly different standards. AI's variance gets flagged as "unreliable." The variance baked into our own work goes mostly unexamined.
But look at any sufficiently large codebase. In theory, it should be deterministic. We write schemas, define naming conventions, create style guides, PR templates, review processes. We build systems that are supposed to make the "right way" obvious.
In practice, most codebases are an accumulation of local decisions made by different people with different contexts and different interpretations of the same rules. The same business logic gets implemented three different ways depending on which team touched it. Naming conventions drift. Folder structures diverge — not because of a deliberate migration, but because someone started doing it differently and the pattern stuck.
"Deterministic" is a generous description of what humans actually produce.
Broken windows all the way down
In The Pragmatic Programmer, this is called the broken windows theory: once visible disorder enters a system, it changes what people believe is acceptable inside that system. An inconsistency becomes a pattern. A shortcut becomes the new convention.
There's empirical support for this in software engineering specifically. In 2022, researchers found that developers working in systems with higher tech debt density were significantly more likely to introduce additional debt themselves (Alfayez et al., 2022). The mess is contagious. Existing disorder doesn't just persist; it reproduces.
And this isn't unique to code. In Noise: A Flaw in Human Judgment, Kahneman et al. found that professional judgment is far noisier than anyone expects – radiologists disagree with their own prior readings of the same X-ray about 20% of the time (!!). These are trained professionals applying structured rules under real constraints, and the variance is still enormous.
We don't skip steps or drift from conventions because we're careless. We do it because humans aren't deterministic systems either. We get tired and overwhelmed. We forget conventions we haven't touched in six months. We bring habits from previous jobs and apply them without realizing it. We copy the nearest file that looks close enough and treat it as the current standard.
The mirror effect
We figured out where this argument lands the hard way. Early in the hackathon, we pointed our tooling at documentation generation: have the LLM read a dbt model and produce a standardized description for each column.
The outputs were all over the place. One run would describe a column as:
the unique identifier for the order
the next as:
primary key referencing the orders table
the next as:
order ID
Technically all of these descriptions were correct, but also inconsistent in exactly the way we were trying to fix. The instinct was to blame the LLM. But when we looked at the existing human-written documentation, it had the same problem – same inconsistencies, spread across hundreds of files where nobody noticed.
The naming convention enforcer was worse. We wanted it to flag non-standard column names, but first we had to define what "standard" meant. We had created_at, creation_date, and date_created all living in production, depending on which team had built the model. The LLM couldn't enforce a convention that didn't exist. And neither could we (humans).
The parts that worked well were the most constrained: scaffolding new schema files from a template, generating test configurations from a fixed set of rules. When the inputs were clear and the expected output was well-defined, the LLM was remarkably consistent. Not identical every time, but consistent in the ways that mattered.
The pattern was unmistakable: underspecified conventions produced noisy outputs. Well-specified ones produced clean outputs. The AI wasn't going rogue. It was doing exactly what a new hire would do – generalizing from the examples available, with no way to tell which pattern is deprecated and which is current unless that distinction exists somewhere outside someone's head.
That's the part we don't always want to admit. If your conventions aren't explicit enough for an AI agent to follow, they probably weren't explicit enough for humans either. Humans just hide the ambiguity better. We ask questions, we copy nearby files, we internalize tribal knowledge. An agent doesn't have that social context. So when its output is inconsistent, the instinct is to blame the tool. But often the AI is reflecting the ambiguity right back at you.
What we actually did about it
Once we saw the pattern, we changed course. We stopped trying to build tooling on top of conventions that didn't exist and started writing the conventions themselves – canonical examples, explicit naming rules, documented patterns that lived in code rather than in someone's memory.
The result was a tool that actually worked reliably and a set of codified standards that made the whole codebase more navigable independent of the AI. The most lasting output of the hackathon wasn't the tooling. It was the conventions.
The real cost of ambiguity
This reframes how engineering orgs should think about AI adoption. The framing of "deterministic humans vs. non-deterministic AI" is a false dichotomy. It's more like "inconsistently variable humans" vs. "predictably variable AI" — and the variable that matters most is the clarity of the system they're both operating in.
The work required to make a codebase legible to an LLM (e.g., explicit conventions, canonical examples, enforceable patterns, documentation that reflects current practice) is the same work that makes it legible to new hires, to adjacent teams, and to your future selves. It's system legibility, and it compounds in every direction.
Underspecified systems used to be a drag on velocity – annoying, but manageable if you had enough senior engineers who'd internalized the tribal knowledge. But the cost of ambiguity went up, because now it's not just slowing down humans – it's actively degrading the output of every AI tool you plug in."
Revisiting the room
I keep thinking back to that moment when the room hesitated. "LLMs are non-deterministic." The concern was real. Nobody wants to introduce unpredictability into systems that other people depend on.
But the concern that stuck with me wasn't whether we could trust the AI. It was that the codebase we were pointing it at had never been consistent enough to trust in the first place.
The problem isn't that AI is non-deterministic. It's that we've overestimated how deterministic we are.