The Data Says You're Not Questioning AI-Generated Code Enough

Question

The Data Says You're Not Questioning AI-Generated Code Enough

Tom SmithverifiedBackerLeader posted Feb 24 3 min read

Anthropic just published research that should make every developer uncomfortable. When AI produces code, documents, or working apps, users stop checking the work. They get more precise about what they ask for up front — but less rigorous about evaluating what comes back.

The company analyzed 9,830 Claude conversations from January 2026 and built what it calls the AI Fluency Index: a baseline measurement of 11 observable behaviors that indicate how effectively people collaborate with AI. The findings are practical and specific. And if you write code with AI — which, at this point, most of us do — the data points directly at habits worth changing.

What They Measured

Anthropic worked with Professors Rick Dakan and Joseph Feller to develop a framework of 24 behaviors representing effective human-AI collaboration. Eleven are directly observable in chat conversations — including iteration, clarifying goals, specifying formats, questioning reasoning, identifying missing context, and fact-checking. Each conversation was scored for the presence or absence of each behavior.

The Iteration Effect

The strongest signal in the data: iteration predicts everything else.

85.7% of conversations showed iteration and refinement — users building on previous responses rather than accepting the first answer. These conversations exhibited 2.67 additional fluency behaviors on average, roughly double the 1.33 in non-iterative conversations.

The effect is most pronounced for critical evaluation. Iterative conversations were 5.6 times more likely to include users questioning the AI's reasoning and 4 times more likely to surface missing context.

This quantifies something experienced engineers already sense. The first response from an AI coding tool is a draft. The developers who get the most value are the ones who push back, ask follow-up questions, and refine through multiple rounds.

The Polished Output Trap

Here's where the data gets uncomfortable. 12.3% of conversations involved artifacts — code, documents, interactive tools, and other concrete outputs. In these conversations, users were significantly more directive up front. They clarified goals (+14.7 percentage points compared to non-artifact conversations), specified formats (+14.5pp), and provided examples (+13.4pp).

But that care didn't carry through to evaluation. Users were less likely to identify missing context (-5.2pp), check facts (-3.7pp), or question the reasoning behind the output (-3.1pp).

Anthropic calls this "more directive but less evaluative." For developers: when Claude hands you a working component or a clean function, your instinct is to ship it. The output looks finished, so you treat it as finished. But Anthropic's own Economic Index found that the most complex tasks are where AI struggles most.

Developers might be evaluating artifacts outside the conversation — running tests, doing code review. Anthropic can't observe that. But a 5.2 percentage point drop in identifying missing context is worth paying attention to.

Three Habits the Data Supports

Anthropic's report identifies three practical behaviors where most users have room to improve. All three are directly relevant to developers.

Stay in the conversation. Don't accept the first response. Ask follow-up questions. Push back on parts that don't look right. The data are clear: iteration is the strongest predictor of all other fluency behaviors. Treat every AI response as a draft you're reviewing, not an answer you're accepting.

Question polished outputs specifically. When AI gives you code that runs, that's when to pause and ask: Is this handling edge cases? What assumptions is it making? What's missing? The research shows this is exactly the moment people stop asking those questions. Build the habit of treating functional output as the starting point for review, not the endpoint.

Set the terms of the collaboration. Only 30% of conversations included users telling the AI how they wanted it to interact with them. Add instructions like: "Push back if my assumptions are wrong." "Walk me through your reasoning before giving me the answer." "Tell me what you're uncertain about." These meta-instructions change the dynamic of the entire conversation. They turn a passive code generator into an active collaborator that surfaces its own limitations.

What This Means Going Forward

Anthropic is treating this as a baseline and plans cohort analyses comparing new users with experienced ones, plus studies exploring whether encouraging iteration actually leads to better evaluation.

For developers, the takeaway is concrete. The gap between using AI tools and using them well isn't about prompt engineering tricks. It's about collaboration habits. Iterate. Question the output that looks most finished. Tell the AI what kind of collaborator you want it to be.

The AI Fluency Index gives us a number for what the best developers already practice: the first response is never the final answer.

2 Comments

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Amara Graham · Answer 1 · 2026-02-24T20:51:21+0000

This sounds similar to spell check tooling that just broadly corrected things, rather than stepping the user through each change one-by-one.

I'm not entirely sure I agree that the first response is never the final answer, but I do think we need to improve our prompts and behaviors to be more critical of the outputs.

	Breaking the AI Data Bottleneck: How Hammerspace's AI Data Platform Eliminates Migration Nightmares Tom Smithverified - Mar 16
	Optimizing the Clinical Interface: Data Management for Efficient Medical Outcomes Huifer - Jan 26
	Systems Thinking: Thriving in the Third Golden Age of Software Tom Smithverified - Apr 15
	Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat abarth23 - Apr 27
	Developers Trust AI Code. They Also Don't Trust It. Both Are True. Tom Smithverified - Apr 30

The Data Says You're Not Questioning AI-Generated Code Enough

What They Measured

The Iteration Effect

The Polished Output Trap

Three Habits the Data Supports

What This Means Going Forward

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Breaking the AI Data Bottleneck: How Hammerspace's AI Data Platform Eliminates Migration Nightmares

Optimizing the Clinical Interface: Data Management for Efficient Medical Outcomes

Systems Thinking: Thriving in the Third Golden Age of Software

Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat

Developers Trust AI Code. They Also Don't Trust It. Both Are True.

More From Tom Smith

HPE Refreshes Its Private Cloud and Storage Portfolio for the AI Era

Twilio Builds for the Agentic Era — And Wants Every Customer Conversation to Feel Continuous

AI Has Left the Lab. Now the Hard Work Begins.

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,163 amazing developers

Don't have an account? Sign up

OR

The Data Says You're Not Questioning AI-Generated Code Enough

What They Measured

The Iteration Effect

The Polished Output Trap

Three Habits the Data Supports

What This Means Going Forward

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Tom Smith

Related Jobs

Commenters (This Week)