Most teams are getting 10 to 30% of what their LLM model can actually do.Not because the model is weak. Because the prompt is.
I’ve spent the last two weeks scoring prompts. Real ones, from real builders, across real verticals, against an 8-dimension quality rubric. This weekend I ran another 500 through the scorer to pressure-test the pattern. Every dataset converges on the same number: the average production prompt scores 13 to 16 out of 80. That’s 17 to 20% of what the rubric says a well-formed prompt looks like.
You’re paying for a Ferrari and driving it in first gear to the mailbox.
What I Measured
Every prompt got scored on 8 dimensions. Each scored 1 to 10, totaling 80:
-Clarity. Is the task unambiguous?
-Specificity. Concrete targets, numbers, scope?
-Context. Background, assumptions, domain?
-Constraints. Limits, rules, edge cases?
-Output format. What shape should the response take?
-Role definition. “Act as a ___”?
-Examples. Few-shot or reference cases?
-Chain-of-thought structure. Reasoning scaffolding?
These aren’t arbitrary. They map to what the prompt engineering literature has known for five years. PEEM, RAGAS, G-Eval, MT-Bench, the Anthropic and OpenAI prompting guides. Everyone agrees these dimensions matter. Nobody’s checking whether their production prompts actually hit them.
The Data

500 software engineering prompts. Real-world format “Build X using Y.”
Average score: 13.3 out of 80
83% graded F. 17% graded D. Zero scored C or above.
After rewriting against the rubric: average 68.5 out of 80. A B+.
Average improvement: +55 points. 425% relative gain.
For context, the organic dataset. 248 prompts submitted by real users of the scoring tool, across 7 verticals, showed the same pattern. 89% graded D or F. Average before-score of 15.8/80.
Software prompts were slightly worse than average. Not better. Engineers aren’t exempt from this. If anything, the “it’s a technical task so it must be rigorous” assumption is the trap.
What’s Actually Missing
Here’s the dimension breakdown. Look at how specific the failure pattern is:
Examples scored 1.01 out of 10. Across 500 prompts that developers wrote to build production software, essentially zero included a reference case, a shape to follow, or a “here’s what good looks like.”
This is the dimension every prompt engineering guide tells you matters most. The gap between what engineers know they should do and what they actually write is near-total.
Constraints at 1.09. Role definition at 1.18. Clarity, the only dimension averaging above 2, sits at 3.19.
Engineers are writing English sentences with tech keywords. The structural scaffolding that turns a wish into a spec is almost entirely absent.
What This Looks Like In Practice
A representative prompt from the dataset:
“Build a real-time collaborative text editor using React for the frontend.”
Scores 14/80.
It sounds specific. It names a technology. It has a verb. But the model receiving it has to guess. Collaboration for how many users? What’s the sync strategy? Operational transform or CRDT? What’s the latency budget? What does “done” look like? No examples. No output format. No constraints.
The rewritten version, same task, generated by the scorer to address the rubric gaps, scored 69/80. It defined the target user. It specified real-time sync requirements. It listed technical constraints including concurrent editors, conflict resolution strategy, and latency targets. It specified the response format. It included an example implementation signature.
Same end goal. Different input. 5x the quality score. The before version wastes tokens on clarification and produces generic output. The after version reads more like a spec than a prompt.
Why This Matters Now
The easy dismissal is “people should write better prompts.” That misses the systemic problem.
The industry has spent years focused on output evals. Every eval platform measures what the model produced. Almost nobody measures what it was given (inputs).
That worked when prompts were single-shot, human-written, and reviewed before shipping. It stops working the moment prompts become infrastructure.
In agentic workflows where one LLM call feeds the next, a 13/80 input becomes the input to the next call, which is already compromised before you add retrieval or structured tool calls. In the last week alone, three x402-native agent systems went live that share the same input surface of natural language:
-Daydreams Taskmarket. Agents bidding on work described in plain text.
-PeptAI. Autonomous peptide discovery running wet-lab orders.
-AlliGo. A credit bureau scoring agent behavior across endpoints.
If the descriptions populating those systems score like the 500 in this dataset, the agent economy is routing compute and payments on structurally empty inputs.
The output eval loop can’t catch this. By the time the output looks wrong, the compute bill is already on your card.
An Infrastructure Problem
Telling engineers to write better prompts is telling engineers to write better SQL without giving them a linter. Telling teams to review prompts manually is telling them to do code review without git blame.
The answer is the same answer every other quality-assurance problem eventually reached. Measure it. Instrument it. Put the measurement in the continuous integration pipeline. Block the bad ones before they ship.
The Solution
It’s already here at https://pqs.onchainintel.net.
1. Free tier: paste a prompt, get the 8-dimension breakdown plus a suggested rewrite. No signup required.
2. Paid tiers: start at $19.99/mo for unlimited private-repo CLI usage. $99.99/mo for teams with GitHub PR checks, Slack alerts, and shared dashboards.
3. x402 tier: the paid API charges $0.025 to $0.125 per scoring call in USDC on Base and Solana.
Subscription available for all tiers via Stripe.
You don’t have to adopt it. But you should at minimum run 10 of your production prompts through the free tier this week and look at the numbers.
If you’re shipping anything that takes prompts from humans or other agents, the input layer is measurable. Start measuring. You’re leaving most of the model’s capability on the table, and you don’t have to.
Try it yourself
If you want to score your own prompts, PQS is live at pqs.onchainintel.net — free tier available, paid tiers for full 8-dimension scoring and batch runs.
MCP server on npm: npm install pqs-mcp-server — drop-in for Claude and other MCP-compatible agents
GitHub Action: PQS Check on the Marketplace — score prompts in CI before they ship
API: Direct x402 micropayments on Base or Bearer API key (subscription tiers)
Data from this teardown: 500 software prompts, average score before optimization 13.27/80, average after 68.47/80, 425% improvement. 416 graded F, 84 graded D, 0 at C+ or above.
If you're in DevRel, DevAdvocate, or DevEx working on AI pipelines — this is the input-quality data your builders need to see. Feel free to forward.
What I'd love feedback on: which of the 8 dimensions surprised you most as the common failure mode? My hypothesis was clarity, but the data says examples at 1.01 average and constraints at 1.09 — the structural stuff almost nobody includes.