The Quality Author: Taste as the Last Bottleneck in AI Development

The Quality Author: Taste as the Last Bottleneck in AI Development

Leader 3 15 50
calendar_today agoschedule12 min read

On where craftsmanship went, why verification gaps appear in its absence, and the one practice AI cannot automate for you.


From Writing Code to Judging Code

There is a sentence that has been circulating through developer circles since 2026, attributed to Steve Yegge and repeated in discussions about AI-assisted software development.

People repeat it with the particular enthusiasm reserved for things that make them feel slightly less implicated in what they have already been doing for months (1):

“Code is a liquid. You spray it through hoses. You don’t freaking look at it.”

The wording matters less than the reaction it provoked. It was not really a reaction at all. It was recognition, the slightly unpleasant kind that surfaces when a sentence names something you had been doing without admitting it.

A request goes into a prompt. Code comes back, sometimes a surprising amount of code.
You skim it. Maybe you run it. Maybe you do not.

The tests pass, or enough of them pass that nobody wants to be the person delaying the merge. CI turns green. The branch disappears into main.

Weeks later production starts behaving strangely and somebody opens a file that nobody has examined carefully since the day it was generated. Then the unpleasant part begins. Not the bug itself. The archaeology.

4

You start reading through functions that seem reasonable in isolation. Every individual decision appears defensible. Variable names are clean, comments are present, error handling exists. Yet after an hour you still cannot answer a basic question:

  • why does this subsystem exist in this shape?

  • Why is this abstraction boundary here instead of somewhere else?

  • Why is retry logic duplicated in three places with slightly different behavior?

  • Why does the architecture feel assembled rather than designed?

Nobody knows with confidence because the code entered the repository through a sequence of prompts, edits, regenerations, partial rewrites, and hurried approvals. It works, mostly. Understanding never fully caught up.

That is the shift many people are describing without naming directly. Most conversations about AI-assisted development focus on generation because generation is visible. Screenshots are visible, benchmarks are visible, revenue charts are visible. Understanding leaves almost no artifact behind except the ability to explain a system when it starts failing under conditions nobody anticipated.

Andrej Karpathy’s AI Ascent 2026 remarks landed in the middle of this transition. (2) The headline many people remember is that he moved from writing most of his code himself to having agents produce most of it within a remarkably short period.

The more interesting observation came later: he argued that LLMs automate what can be verified while human judgment remains necessary for deciding what is worth building in the first place. People nodded.

The problem is that obvious statements often conceal the thing that matters.

Karpathy’s observation carries a fairly brutal implication. For decades, software organizations behaved as though production was the scarce resource, structuring hiring plans and management rituals and release processes around increasing output. Then the cost of producing code dropped dramatically. Not to zero, but enough to expose something awkward.

The bottleneck did not disappear. It moved. Suddenly there was an abundance of artifacts and a shortage of people willing to interrogate them deeply enough to know whether they deserved to exist. That shortage is harder to measure, which is exactly why it matters.


The Speed Feels Like Quality

2

The acceleration was real and pretending otherwise is pointless. By the mid-2020s, AI coding assistants had become routine across professional development workflows, with platform reporting and industry surveys showing broad adoption among working developers.

Industry discussions increasingly treated AI-generated code as a normal part of development rather than an exception. Products like Lovable, Bolt, and Replit made it possible for people with little or no programming background to assemble functioning applications through conversation.

Lovable alone reportedly crossed $100M ARR while reporting more than 100,000 projects generated per day. (4) Reports of extraordinary product velocity became common enough that people stopped questioning the timelines.

Then maintenance showed up. It always does.

This time it arrived disguised as something small and irritating: a permissions bug nobody could reproduce consistently, a dependency conflict introduced months earlier by a generated patch nobody remembers reviewing, or a memory leak that only appeared under traffic patterns absent from staging.

You start investigating what should have been a twenty-minute issue. Six hours later you are staring at a call chain looping through generated abstractions layered on top of generated abstractions. Every layer appears sensible. The whole thing feels wrong in a way you cannot immediately articulate.

Engineers describe these systems as ownerless, and that word captures something important. Traditional codebases can be ugly, chaotic, overengineered, underengineered, or all of the above, but they often contain evidence of struggle.

You can see where somebody fought with a problem, you can see compromises, you can see scars. AI-generated systems frequently arrive polished in a way that feels strangely detached from the decisions embedded inside them. The surface is smooth. The reasoning is harder to find.

Research has been pointing in a similar direction. A large-scale study covering more than half a million Python and Java samples found that AI-generated code tended toward repetition and simplicity. It also exhibited maintainability and security concerns at higher rates. (5)

Stanford-affiliated researchers found something even more unsettling: developers using AI assistance often produced less secure code. They simultaneously expressed greater confidence in its security. (6) The exact percentages matter less here than the pattern. The code looked convincing enough that inspection felt optional.

That feeling is where the cost begins. It is also where speed stops being a workflow advantage and starts becoming a philosophy of production.


The Technique Problem

3

Alex Wennerberg reached for Jacques Ellul’s idea of technique to explain why all of this felt familiar. (7) Ellul is difficult to summarize because the argument resists compression into the kind of slogan modern technology culture prefers.

Roughly, technique describes what happens when efficiency stops functioning as a tool and quietly becomes the objective itself. The metric survives; the thing being measured slowly disappears behind it.

Wennerberg applied the idea to music first, noting that streaming platforms don’t optimize for artistic depth but for measurable behavior: engagement, retention, listening time. Music enters one side of the machine and exits the other as a collection of signals. AI-generated music fits neatly into that environment because the surrounding system already rewards endless quantities of acceptable output.

Software has drifted into similar territory. It is difficult to ignore how much of the industry’s language revolves around throughput: velocity, deployment frequency, time-to-market, story points.

None of these measures are inherently misguided. The problem begins when they become proxies for quality and then quietly replace quality altogether. AI looks miraculous inside a system optimized primarily for output because output is exactly what it produces in abundance.

The surprise comes later, usually after enough generated software accumulates that somebody has to maintain it, and organizations discover that code was never the scarce resource. Understanding was.

Wennerberg’s conclusion remains one of the better summaries:

“We can continue to do not very good software much more quickly and effectively with AI.”

But AI cannot solve the main systemic problem in the software industry, which is that we still haven’t figured out how to build software well at scale.

Doing that requires a sense of craft and real human critical thought. What interests me is that a similar conclusion emerged from people who care about entirely different things.

Bessemer Venture Partners approached the issue through competitive advantage rather than craftsmanship and still ended up circling around judgment and taste. (8)

That convergence is difficult to dismiss, because it suggests the same thing from two directions: when production becomes cheap, craft does not vanish. It relocates.


Craftsmanship Didn’t Disappear. It Moved.

12

If production is no longer the only place where human effort appears, craft has to be looked for somewhere else.

Craftsmanship did not disappear when AI started writing code. It moved from production into acceptance.

That acceptance layer is where taste becomes visible. Not taste as aesthetics, preference, or polish, but taste as the disciplined refusal to accept plausible output until it has survived inspection.

I learned this most clearly from a failure that did not look like a failure at first.

In our work on [AI-SLOP-Detector,][1] I was investigating a scoring system that appeared healthy. (9) The outputs looked plausible, the tests passed, metrics moved in expected directions, nothing was visibly broken. If you had shown me the dashboard without context I probably would have approved it and moved on.

Instead I spent several nights digging through it because something felt off, and that sentence sounds mystical written down, but it wasn’t. There was no magical intuition involved. The discomfort came from tiny inconsistencies that refused to disappear: scores clustering too neatly, certain edge cases behaving suspiciously well.

Every time I convinced myself the system was fine, another detail surfaced that made the explanation less satisfying.

The debugging process was ugly in the way debugging always is but rarely gets documented. There was no elegant detective moment.

Mostly, it meant opening files, tracing data paths, dumping intermediate outputs, and discovering that previous assumptions were wrong. It meant rewriting scripts, rerunning experiments, and repeatedly ending up back where I started.

I became convinced the issue was buried in preprocessing. It wasn’t. Then I blamed the evaluation pipeline. Wrong again. Hours vanished into dead ends until eventually the explanation emerged: the model had effectively learned to reproduce the scoring formula used to generate its labels.

It looked intelligent because it was mirroring the structure of the training process. The postmortem summarized this with a sentence that was funny only because it was painfully accurate: we trained a calculator.

The lesson was not that the system failed, because everything fails. The lesson was how long the failure remained invisible because the outputs looked convincing enough to discourage deeper inspection. That distinction matters more now than it used to. The labor increasingly happens after generation.


What “Taste” Actually Means

10

That is why the word taste needs to be handled carefully. It gets thrown around constantly in AI discussions and usually means almost nothing. It functions as a flattering placeholder for qualities people cannot define. Somebody produces good work, somebody else calls it taste, conversation ends. That definition is useless.

The version that matters in engineering has very little to do with aesthetics and a great deal to do with irritation. Specifically: the inability to stop investigating after everyone else has become satisfied. I keep thinking about a GitHub discussion titled [Every Claim, Verified Against Source Code][2], (10) which was almost aggressively boring as an exercise.

Public claims checked against implementation details, files opened, line numbers traced, assertions treated as things requiring evidence rather than things requiring agreement. A table with columns for claim, verdict, source file, line range. Nothing about that table is exciting, and that is precisely the point.

Verification is usually tedious. It involves reading things you hoped you would not need to read: documentation that is wrong, comments that are outdated, assumptions that are unsupported, previous confidence that was misplaced.

People claim verification is important and then structure their workflows around avoiding it because verification is where comforting narratives go to die. Taste, at least in the form that survives contact with real engineering work, is the habit of continuing anyway.

Not because you enjoy it, but because experience has taught you that plausibility is cheap.


The Verification Gap Is the Bottleneck

17

Whenever concerns about AI quality surface, somebody proposes a tooling solution: better scanners, better CI, better automated review systems. None of those are bad ideas, and most are useful, but the problem is that they arrive too late in the chain.

Before any tool runs, somebody has to decide that scrutiny is necessary. That sounds trivial until you remember what software development actually feels like under pressure.

Deadlines move closer, stakeholders want updates, the rollout appears stable, metrics look healthy, you are tired, and the temptation to stop asking questions becomes overwhelming.

I remember lowering a deployment threshold because everything appeared fine. The tests passed, monitoring looked normal, nothing suggested immediate danger.

Months later, while investigating something completely unrelated, I discovered that the threshold had been quietly shielding the system from edge cases I had never examined properly.

There was no catastrophic outage, no dramatic lesson, just the slow realization that I had mistaken the absence of visible failure for evidence of understanding. Those are not the same thing, and a lot of AI-assisted development feels trapped inside that confusion.

That is the point where this stops being a tooling problem and becomes a judgment problem.

Karpathy’s observation keeps resurfacing because it points directly at the problem: models can automate what can be verified, but they cannot decide what deserves verification.

They cannot determine whether a benchmark measures something meaningful, and they cannot inherit the maintenance burden waiting six months down the road when assumptions collide with reality. (2)
10

Humans inherit that burden whether they want to or not. Research keeps finding variations of the same pattern. A 2025 study examining experienced open-source developers found that participants expected AI assistance to save substantial amounts of time.

In practice, debugging and correction often consumed much of the anticipated gain. (11) That result makes sense to anyone who has spent a night chasing a production issue through generated code.

Real debugging rarely resembles the clean narratives people tell afterward. Logs contradict each other, instrumentation turns out to be incomplete, the bug migrates every time you think you have isolated it, you patch one issue and expose another.

By three in the morning you are questioning assumptions that seemed obvious at midnight. By sunrise the original problem has become tangled with several others and you are no longer entirely sure which one started the investigation.

Verification is difficult because reality resists simplification.


The Deeper Asymmetry

People increasingly describe developers as orchestrators of intelligent systems. Maybe that description is accurate. What bothers me is how easily the metaphor creates distance between builders and consequences. Eventually somebody still has to read the logs.

Still has to explain why production diverged from testing. Still has to figure out why memory consumption exploded after traffic crossed a threshold nobody modeled correctly.

The details never disappear. They wait.

This is why craftsmanship remains a more useful concept than productivity. Productivity measures output. Maintenance measures responsibility. The second category has a habit of reasserting itself no matter how sophisticated the tooling becomes. Wennerberg quotes Jonathan Blow’s complaint that the industry has forgotten how to do things. (7)

That observation lands differently now. AI can conceal the forgetting process for surprisingly long periods. A developer can remain productive while understanding less and less of the machinery moving beneath the surface. Demos succeed, launches succeed, quarterly reports look healthy. Then maintenance arrives and starts demanding explanations. Explanations are harder to generate than code.
20

The developers who remain valuable in that environment will probably leave behind evidence that they inspected things. Not polished declarations about quality. Evidence: audit trails, rejected implementations, strange tests written specifically because somebody distrusted an assumption. Notes documenting why a seemingly reasonable solution was discarded after deeper investigation revealed a flaw.

Those artifacts matter because they reveal proximity. Somebody stayed close enough to the system to notice where reality diverged from appearances.

Perhaps that is where authorship survives. Not in generation, because generation is abundant now. In acceptance. The moment you decide a generated artifact is good enough to carry your responsibility, it stops being the model’s problem and becomes yours.

Everything unpleasant follows from that decision: the doubt, the investigation, the corrections. The nights spent tracing behavior through systems that looked perfectly reasonable when they were first merged.

Scarcity has moved. The code is easier. Living with it is not.


References

  1. Tim O’Reilly, Software Craftsmanship in the Age of AI, O’Reilly Radar, 2026. Includes discussion of remarks attributed to Steve Yegge.
  2. Andrej Karpathy, Sequoia AI Ascent 2026, karpathy.bearblog.dev, 2026. Widely summarized through event notes and secondary reporting.
  3. General adoption trend based on publicly available platform reporting and industry surveys from the mid-2020s. No single source cited; GitHub Octoverse and Stack Overflow Developer Survey editions from this period are representative. See GitHub Blog.
  4. Lovable hits $100M ARR, Sifted, 2025.
  5. Domenico Cotroneo et al., Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity, arXiv:2508.21634, 2025.
  6. Neil Perry et al., Do Users Write More Insecure Code with AI Assistants?, arXiv:2211.03622v3, CCS 2023.
  7. Alex Wennerberg, AI Code and Software Craft, alexwennerberg.com, 2026.
  8. Lindsey Li et al., Developer Laws in the AI Era, Bessemer Venture Partners Atlas, 2025.
  9. flamehaven01, The Tool That Turned on Itself, dev.to, 2026.
  10. flamehaven01, AI-SLOP-Detector v3.50: Every Claim, Verified Against Source Code, dev.to, 2026. Used as an illustrative example of verification practice rather than independent evidence.
  11. Joel Becker et al., Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, arXiv:2507.09089, 2025.
3.2k Points68 Badges3 15 50
South Koreaflamehaven.space
45Posts
26Comments
24Followers
23Connections
Founder designing Sovereign AGI & Scientific AI systems — governance, reasoning models, medical/physics AI, multimodal engines, and secure operational infra.
Build your own developer journey
Track progress. Share learning. Stay consistent.
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

Breaking the AI Data Bottleneck: How Hammerspace's AI Data Platform Eliminates Migration Nightmares

Tom Smithverified - Mar 16

Local-First: The Browser as the Vault

Pocket Portfolio - Apr 20

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Ken W. Algerverified - Jun 4

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

Ken W. Algerverified - Apr 28

Split-Brain: Analyst-Grade Reasoning Without Raw Transactions on the Server

Pocket Portfolio - Apr 8
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

6 comments
3 comments
2 comments

Contribute meaningful comments to climb the leaderboard and earn badges!