Generative Coding and the Inverse Mechanical Turk

This spun off from a conversation about Generative AI code systems and about their impact more broadly, and if the usage of developers as a kind of inverse Mechanical Turk processors to validate code. The tl;dr of this is: I think generative coding systems accelerate your lowest expertise developers more than your senior developers, due to low expertise developers accepting solutions that wouldn't meet the quality bar for more senior developers. Fixing this would require relatively systemically re-evaluating the code review process and new approaches to making sure that the risks created by the novel ways that generative code intersects with implicit requirements are addressed.

Using Developers as an Inverse Mechanical Turks

At the moment, we're seeing big steps forward in the ability of LLMs to generate functional code - I've seen it with my friend's projects, I've experimented a little bit with it myself, although the actual effectiveness / productivity increase measurements are messy at best. The systems can't be entirely closed loop, they require a human in the loop to validate the code is as functional as it claims.

Functional in this case means "it solves the specific problem that the user had at the time". I had a friend rip out some CANBUS functionality for datalogging on a motorcycle, and it did the job with a little prompting and saved a few hours of their time. This is generally the bit of core possibility that underpins the hype - it's very compelling to see someone solve a problem they have quickly based on a set of generalized statements.

As we see this scale and adopt, though, the majority of the success stories that I hear on this are from high expertise folks who say "Well, now my job becomes code review, which is great - I can think in the generalized abstractions that I want to have, instead of having to spend a bunch of time writing code I don't really want to care about". This is great if you have the pre-existing context to define an appropriate Accepted Scope.

Accepted Scope

Accepted Scope is how the developer is balancing the explicit requirements versus the implicit requirements of a given piece of development work. Explicit requirements are "this code must move data from A to B", implicit requirements are "the code meets security, style, reusability, reliability, or other characteristics that besides the movement from A to B". Having the data move? Explicit requirement. Having it logged in a usable and meaningful fashion? Implicit requirement.

In the best environments, there is a high quality dialog between the people who are accountable for prioritization and the developers to balance the accepted scope of each project they're on. They might do some fast work to hit a particular milestone, and then refocus on the next round on making sure that the lack of meeting certain implicit requirements is addressed to avoid downstream technical debt leading to feature / product slowdown, etc.

In the most ideal world, they're looking at the critical implicit requirements, and building frameworks or infrastructure or tooling to make sure they're easy to meet without having to spend a lot of extra energy, creating systems where the secure / appropriately logged / well structured way to do things is the easy way.

Side note: This was one of the best parts about Google's infrastructure - when you used the Google tooling, you had the advantage of a bunch of patterns that were safe / secure / made sure you did hygiene properly, as well as a number of other controls in place. It had the downside of often not explaining why you couldn't do the thing that felt obvious, and that problem, ironically, could have been probably usefully solved by LLMs.

Generative Code & Accepted Scope

When you use LLMs to generate code, the implicit requirements move from being defined by your company culture and the previous experience of your existing devs, as a decision that can be introspected, trained, and changed, to the average of the implicit requirements of the training dataset. A good developer will note this and correct it when reviewing code, based on their experience with the codebase they're working in.

There's two major concerns in this. The first is that developers aren't generally trained on this as a skillset - the assumptions for reviewing human code and machine code are very different. The second is that doing this, and doing it well, is going to slow down the delivery of the code, while improving quality.

But more broadly, your juniors will work faster because they will not have context on implicit requirements, pushing progressively more risk into your organization faster. Critically, this also pushes more work on to your senior staff, because they're the only ones who would have the context to perform reviews at that level of depth. Additionally, it's extremely hard to predict what the implicit requirements met or not met by any specific piece of code will be, because the generative code has no ability to understand what code quality means in your specific environment.

The Outcomes

This becomes a business scale problem: You have a code review pipeline that is designed to handle human code review problems at a certain scale, and you're pushing into that system a new coding paradigm that has a completely different set of expectations and failure states. You have also changed the scale of the problem: Your developers will now move faster the less they care about implicit requirements. Senior developers might care about this problem, but junior developers won't even really be able to contextualize the problem without significant mentorship and support from senior folks to explain why this code that solves the problem is better than this other code that also solves the problem.

Massively increasing code review load and scope simultaneously is a good way to increase the number of bugs in code.

Some Common Arguments / Rebuttals

"Just prompt the implicit requirements into the models!"

It's not possible to define the implicit requirements reliably, not only because "define secure code" is an exercise in shifting context, but also because you can and should vary your implicit requirements.
This also assumes you can make the models generate code consistently and reliably in the same fashion based on a given prompt, which is contrary to the design of a system that is supposed to generate novel code.

"Humans make mistakes too!"

A human who makes the same mistake is likely to make them in a reasonably consistent way, so you can add a technical check, linting, or automated scanning for those specific problems. An AI is likely to make the mistakes in a novel way each time, because that's how they work, so it's much harder to define the behavior patterns either on the positive or negative side, because you have to account for the sum totality of the ways human write code, vs the specific contexts of your developers.

Conan’s Blog

A certain generous impatience has not permitted that I learn to read.