At Mechanical Orchard we use GenAI in day-to-day software development (using tools like claude-code, cursor, and so on), and we also produce tools that help you use GenAI to modernise mainframe workloads. We also care deeply about code quality and maintainability. Most of us come from an XP background. We practice TDD and pair-programming. As you might imagine, there’s a lot of internal conversation about the role we want GenAI to play in a sustainable software development life cycle.
This blog post was originally a part of that internal conversation. I wanted to explore some of the tensions around how trustworthy LLM-generated code can be. What does it even mean when software developers trust a piece of code?
LLMs can be used as guides, as enthusiastic-but-inexperienced interns, as compilers, and many other things.
One popular perspective is the “LLM as a compiler” view. In this view:
The LLM allows me to focus on a higher level of abstraction than the code.
This means I can be way more productive
This has come up in articles like Yegge’s Gastown blog:
I’ve never seen the code, and I never care to, which might give you pause. ‘Course, I’ve never looked at Beads1 either, and it’s 225k lines of Go code that tens of thousands of people are using every day.
And like Breunig’s codeless-library blog:
the
whenwordslibrary contains no code. Instead,whenwordscontains specs and tests
These usages of LLMs require us to trust the output of the LLM without having a human read and understand it. This is controversial.
Let’s discuss where our trust comes from, and how that works when us software-developers start treating LLMs as compilers — or as trusted co-workers.
I am not going to talk about other uses of LLMs which require less trust.
There are loads of ways to get value out of LLMs even if you barely trust them at all. If I’ve finished writing this other blog by the time you read this one, then you might find some of them there. If not, I recommend LLM use at Oxide for a good nuanced view.I am not going to talk about maintainability, even though it’s a closely related issue.
There’s a significant and valid fear that LLM-generated code speeds us up today at the expense of slowing us down tomorrow when we have to maintain the code the LLM produced.
That is a massive and subtle issue that deserves its own treatment.I am not going to talk about prompt injection or tool usage, even though that’s a closely related issue.
The possibility of the LLM secretly stealing all your sensitive data, or accidentally wiping your disk is a real problem. And it will have a bearing on your feelings of trust. But it’s not about the quality of the code the LLM produces. We can address prompt injection and tool usage with sandboxing techniques. We cannot address code quality that way.
I trust the code my friends write.
This is what LLMs are competing with, and this may be a big part of why this is controversial.
At MO, we have these code quality gates:
We only hire people who can write code I want to trust
We pair on — or at least review — all our code
We test.
A naive use of LLMs to generate code breaks all three of these gates.
We have not interviewed the LLM
…and we kinda can’t. Partly because it’s an alien, and all our interview practices are very human-focussed, but also significantly because the LLM keeps changing under our feet. It’s a software service with no declared API and no changelog. Although to be fair, at the moment that largely manifests as “it keeps getting better in every measurable way”, and maybe that’ll continue 🤷♀️If we’re not inspecting the code the LLM produces, then no one is pairing on it.
If we’re not inspecting the tests the LLM produces, then is it really testing?
If we want to trust the LLM in the same way we trust each other, then I think a good way to measure it is to pair with the LLM as if it were a human. I’ve spent a lot of time actively pairing with Anthropic’s Claude as if it were a human colleague2, and I do not currently trust it without close supervision.
That’s not to say that there aren’t other ways I can trust an LLM. I’m just saying that I don’t trust it the same way I trust you.
I trust the output of my compiler
I think this is closer to the kind of trust that Yegge has in his Beads and GasTown tools.
It’s easy to get distracted by determinism here, but if you think about it,
...are you actually sure that GCC or javac are 100% deterministic?
I’m pretty sure that most javascript engines are deliberately not deterministic. I know that (at least 15 years ago) JIT optimisations in V8 and spidermonkey meant that the code of a given loop body was pretty much always different after a few iterations. I suspect these days it’s also different depending on what hardware your browser can access, and what load that hardware is under.
I wouldn’t be surprised if GCC’s best-optimised code had a bit of randomness injected too.
Compilers are wild y’all.
I think we actually trust compilers because they’ve done what we want a whole bunch of times, and haven’t done anything bad in a really long time.
The Kate Test
My friend Kate Spinney is an engineer who works here at MO on both InfoSec and Infrastructure3. She said this really well this week. To paraphrase:
I’ll be happy “not looking at the code” the LLM writes when I’ve previously had the experience of spending 5 hours going through a whole bunch of code it wrote previously, and:
I’ve not wanted to rip any of it out
I’ve not wanted massively change any of it
I love this because it’s measurable, and it’s plausible. It’s not “an argument for why I can never trust an LLM like I trust a compiler”. I think it’s likely that LLMs will get to this point, even if they’re not there right now.
I trust it because I saw it running
This may be more or less why we trust libraries and tools we download from the internet. We don’t know the people who wrote those things. We haven’t read the code. We tried it on a few examples, and it seemed cool. It looks like loads of other people tried it on their examples too, and it worked for them.
Yegge specifically calls this out in his blog post:
‘Course, I’ve never looked at Beads either, and it’s 225k lines of Go code that tens of thousands of people are using every day.
If tens of thousands of people have run the thing and not had significant issues, it’s probably cool. Given enough eyeballs, all bugs are shallow.
There’s an open question as to how much we need to see it running before we’re convinced to trust it.
For me personally, I’d trust Yegge’s Beads about as much as I’d trust most other OSS projects, not because the LLM is good, but because of those tens of thousands of eyeballs. I’d trust something an LLM produced just for me significantly less, because I’m the first person to run the thing.
…I saw it running loads because tests.
If I have tests that I already trust, I can certainly use them to gain significant amounts of trust in the code the LLM gave me. However, now we have to figure out:
Do we trust the tests?
If humans we trust wrote them, sure. If an LLM wrote them, maybe? See below…Do the tests cover everything relevant?
We’re used to testing first, so we usually get to assume that the tests are a good reflection of the code. LLMs do not TDD, even when they claim to. Even when presented with a well written test of tests and asked to produce minimal code which passes these tests and does nothing else, LLMs will sometimes add handy extra features just in case.
I trust it because another LLM read it
I’ve certainly gotten a lot of mileage4 out of asking an LLM to explain its work to me after it wrote the thing.
I believe it’s possible to build a technique using multiple LLMs and multiple passes of “read this”, “improve it”, “simplify it”, “is there anything we can refactor?”, and so on.
I have not yet managed to build this technique/tool out to my satisfaction. If I ever do manage it, I hope that the resulting code will pass the Kate test.
I don’t need to trust it as much as you think
Everything above has implicitly assumed that we need to trust our LLM-generated code as much as we trust our human-generated code. But that might not be the case for much longer.
So what if we produce code that humans can’t read? It won’t pass the Kate test, but maybe all we care about is that we don’t get sued?
Here are some things we definitely still care about:
The code should be acceptably free of exploits
…and acceptably free of bugs
Here are some things we may no longer care about:
Readability. So long as LLMs can read and change it, that’s fine.
Performance. Computers are fast.
Dead code. Disks and memory are big.
Aside: notice that these are all the same things that early compiler advocates used to say.
Early compilers produced code that was hard to read, slow, and full of dead code. The assembly produced by early compilers would not have passed the Kate test.
Even so, we still need to be able to reason about exploits and bugs. If we’re not using trusted tests and trusted humans to ensure our code is exploit-free and bug-free, what are we using?
It’s also worth calling out that if we choose to stop caring about some things, then we have to do it all together. It’s no fun being the only person hand-editing the assembly that everyone else is churning out by the killobyte with gcc.
Where I’m at right now
I read every line of code my LLM produces.
My best shot so far at something “higher level” has been getting other LLMs to read it, and ensuring that I’ve read and understood at least the tests. I haven’t yet gotten the results I want, even with a mind towards a world in which everyone is using LLMs to interact with this code.
So long as LLMs continue to improve, I believe that a time will come when people with access to them can step up an abstraction level for much of the sort of work I do.
If we want to get there sooner, I think we should not focus on gastown-like tools and having LLMs produce more code. I think we should focus on LLMs for comprehension, refactoring, and simplification.
By the time you read this however, there’s a good chance that “where I’m at right now” will have changed. Here are some questions I’m actively working to answer right now:
How close can we get to passing the Kate Test with some of the agentic workflows some of us are already using?
Can we even get LLMs to write code that’s indistinguishable from human-written code? Is there value in a software-developer-Turing-test?
Quite aside from the trustworthiness of the code our LLMs output, can we trust the processes they use to output it? Can we effectively prevent an LLM from exfiltrating sensitive information while it’s writing all that beautiful code for us?
Stay tuned.
For more on Yegge’s Beads library, see this introductory blog post.
..and I’ve written about this internally. If that sort of thing is of interest in the wider world, then drop me a comment below, and I’ll see about writing something on the topic here.
The Mechanical Orchard Infrastructure team is a DevOps/SRE-like team, responsible for building and managing all our software infrastructure. “Infrastructure” here can mean things like networks and kubernetes clusters that run MO services; or it can mean things like the “modernisation in a box” that we provide to customers, for them to run in their own datacentres.
...again, if you want to read about that, let us know in a comment.



