If you are using Claude Code in production, the agent is not the problem

Question

If you are using Claude Code in production, the agent is not the problem

JonLeader posted 7 seconds Originally published at medium.com 11 min read

The agent is not the problem. The substrate around the agent is. Here is the four-file shape that made my Convex production-database write boring on 2026-05-06, plus a stdlib Python script the Claude Code agent calls but did not generate.

Before the question lands: the bridge is not an agent improvising near prod. It is a deterministic Python script the agent calls. It writes to a bounded Convex HTTP endpoint. Prod writes need an env flag and a 401-preflight to fire. The cache dedups on a composite idempotency key. The logger strips anything that smells like a secret before it hits disk. The agent did not invent any of that.

In late April, a viral Reddit thread about an agent deleting a production database was making the rounds (recap). The fear was real. My bridge is still running because the agent did not have the room to improvise. Below is what made that true. Every claim has a file path you can grep for in your own setup.

What context engineering actually means in Claude Code

Vendors are calling context engineering the skill that separates serious solo operators from people who dabble. The trend is real. NxCode says over 60,000 public GitHub repos now include some form of agent instruction file. What it means in Claude Code is concrete: a CLAUDE.md at every project root, a global one at ~/.claude/CLAUDE.md, an auto-memory layer at ~/.claude/projects/, plus the skills directory, plus the templates directory, plus the hooks. Each one is a slot where context lives across sessions instead of being retyped into the chat.

Context engineering is not prompt engineering 2.0. Prompt engineering happens in the user message. Context engineering happens in the files Claude Code reads before the user message. Identity. Business state. Registries. Decisions. Voice. Scripts. Memory.

The common framing is one big system prompt and hope for the best. That breaks the moment your work spans more than one task surface. The layered framing replaces it. Folders, files, hooks, memory, vault. Each layer answers a question Claude would otherwise have to guess at.

There is a name for the pattern people are converging on: the control stack. Project rules plus reusable skills plus bounded sub-agents plus deterministic tools. The label is just the label, but the substance is what matters. Project rules go in CLAUDE.md. Reusable skills go in ~/.claude/skills/. Bounded sub-agents are described per-task in their own SKILL or template files. Deterministic tools are scripts the agent calls (not generates) when the task needs a specific outcome.

A vibe-coded one-shot script and a rail with all of this behind it can both run on a Tuesday. The difference is what is around the script. Tests, logs, idempotency cache, config gates, audit trail.

Both ship. Only one survives Wednesday.

Only one helps when you come back six weeks later and need to remember why a flag is set the way it is. The vibe-coded one is forgotten code. The control-stack one composes.

Concretely, here is what feeds my Claude Code session every conversation, before I type anything:

~/projects/agent-os/CLAUDE.md is the load-bearing identity file. Who I am, what I sell, who I sell to, the 90-day priorities. The agent does not ask. It reads.
~/.claude/projects/-home-jon/memory/MEMORY.md is the auto-memory index. User profile, feedback rules, project state, references that survive across sessions. The agent does not relearn me every conversation.
references/framework.md is the operator playbook. How decisions get made, what to optimize for, and what holds the rest together when the work scales.
decisions/log.md is the append-only why-log. Not just what changed. Why.

Those four files are part of the substrate. The identity and memory layers are loaded by default. The others are where the agent goes when the task calls for them.

The layers, one folder at a time

The Agent OS lives in three roots. ~/projects/agent-os/ for the project layer. ~/.claude/ for the global layer. ~/vault/ for the second-brain layer. Each layer has one job, and Claude Code reads them in a known order.

connections.md is the registry of every external system Jon-OS can reach. Gmail wired via the gws CLI. Drive the same way. GitHub via gh. Gumroad via PAT. NotebookLM via the notebooklm CLI. Status and last-touched date on each. When I started the Skool bridge, the agent did not have to guess whether Convex was reachable. The registry said yes, here is the deployment slug, here is the auth model. One row per system. The bridge never blocked on "is X wired."

decisions/log.md is the append-only record of every non-trivial decision and why. Reversible decisions get a one-line entry. Load-bearing ones get the full receipts. The decision to gate prod writes on SKOOL_ALLOW_PROD_WRITES=1 plus a 401-preflight against an allowlisted Convex deployment lives there. So does the decision to keep member content deny-by-default in logs. Future me reads it. Future agent reads it. When a teammate asks "why is this flag here," the answer is already written.

references/ holds voice samples, API guides, framework writeups. references/skool-api.md had the Skool Sheets endpoint shape, the field names, and the failure modes spelled out before any code got written. The composite idempotency key was specified there as {tab_slug}:{normalized_transaction_id}, before any handler. The bridge inherits the discipline of the spec. The spec inherits the discipline of the research that produced it.

scripts/ holds gumroad_api.py, skool_sheets_to_convex.py, and a handful of others. Stdlib-only Python. Deterministic tools the agent calls, not generated on demand. The Skool bridge is skool_sheets_to_convex.py. The redacting logger is one function in that script that strips email-shaped substrings and known secret prefixes before any line lands in the journal. The agent did not write that on a hunch. It came from the spec.

~/.claude/skills/ and ~/projects/agent-os/.claude/templates/ hold reusable skills and prompt templates. The Bike Method (training wheels first, then take them off) is encoded here. Prompt template first, then SKILL.md, then orchestrator. New capability gets the lowest-autonomy version that works. It only graduates when I make an explicit edit. The content pipeline I am running this post through started life as a prompt template. The next phase turns it into a skill. Slow on purpose.

~/.claude/projects/-home-jon/memory/ is Claude Code's auto-memory layer. It survives across conversations. The workflow rule that produced the Skool bridge lives here as a feedback memory. Research, then planning, then spec, then implementation, with Codex adversarial review at each phase. The agent does not relearn that every session. It just does it.

~/vault/ is the Obsidian second-brain. PARA-routed. Sessions, drafts, insights, workflows, references. The session note for the Skool bridge ship day lives at sessions/2026-05-06-skool-rail-live-content-pipeline-aegis-staged.md. Markdown plus frontmatter plus wikilinks. Plain text, plain folders, no platform lock-in.

Each layer has one job. Adding a new system means adding one row to connections.md and maybe one script to scripts/. The layers stay stable, so the work compounds. That is the boring promise of the control stack.

A worked example: the Skool transactional rail

The Skool transactional rail is a Python bridge. It polls two Google Sheet tabs that Skool-driven Zaps populate. New paid-member rows POST to a Convex HTTP endpoint that owns my YCAH member table. The other tab parses through without posting. The bridge runs on a systemd user timer:

OnCalendar=*:0/10

Every 10 minutes on the boundary. It has been live in prod since 2026-05-06.

Here is how it got safe.

The spec lived in references/skool-api.md before the code existed. Endpoint shape, field names, normalization rules, failure modes, idempotency strategy. Researched first, written second. The composite idempotency key was decided there:

{tab_slug}:{normalized_transaction_id}

Before any handler.

The spec went through Codex twice for adversarial review. The first pass killed a cookie-auth approach that would have violated Skool's ToS. The second pass drove composite idempotency keys and the prod-write guard. Both passes still missed an inferred field assumption. I had assumed Skool's "Answered Membership Questions" trigger exposed the member's email. The dry-run caught that the field was not there. Not Codex. Not me re-reading my own spec. The dry-run, against real data. The honest version: review found a lot, but the empirical step still earned its keep.

The pivot was clean because the spec was a document, not a vibe. I rewrote the affected section to "parse the QA tab, but do not POST member rows from it (the email-bearing trigger is the New Paid Member event on the Paid Members tab)," shipped Option A, kept moving. If the spec had only lived in my head, I would have been mid-implementation when reality disagreed. Instead, I edited a markdown file.

The cache had a quieter bug. The initial _read_json swallowed JSONDecodeError and returned a default empty dict. Under the corruption test in the verification checklist (step 0e: deliberately corrupt the cache file, run the bridge, check what happens), it would have silently rebuilt the processed-events cache and double-POSTed every prod row that had already been posted. Caught and fixed before the canary ran. To be precise: the test caught it. Not the spec, not the agent, not me eyeballing the diff. The verification checklist did its job.

The other guardrails earned their keep too:

Prod-write guard: SKOOL_ALLOW_PROD_WRITES=1 plus a 401-preflight against the allowlisted Convex deployment slug means a misconfigured run cannot touch prod. Fails closed at startup, before the first row is read.
Redacting logger: strips email-shaped substrings and known secret prefixes from every line before it hits the journal. Member content is deny-by-default. The bridge never logs row contents, only IDs and outcomes.
Canary verified: jgerton+*Emails are not allowed* landed clean in prod Convex. The idempotent re-run produced 0 new POSTs, 1 skipped, 0 errored.

None of those guardrails came from the agent improvising. They came from the spec. The spec came from the research phase. The research phase came from the workflow rule in memory. The substrate persists the rules. Discipline runs them. The dry-run and the corruption test catch what reviews miss. The agent itself wrote unsafe cache behavior on the first pass. The checklist caught it.

Sidebar: my Claude Code workflow rule

The discipline above is anchored in one feedback memory at ~/.claude/projects/-home-jon/memory/feedback_workflow_research_to_implementation.md:

Research, then planning, then spec, then implementation, with Codex adversarial review at each phase.

Codex (the OpenAI Codex CLI) is the adversarial reviewer that runs alongside Claude Code in my setup. It runs against the spec, then again against the implementation, before anything ships. The reviews catch the things I am too close to see. The dry-run catches the things Codex assumes. The verification checklist catches the things the dry-run does not exercise. The agent itself catches almost nothing past the first draft, and that is fine, because the agent is not the safety net. The workflow rule is. The rule lives in Claude Code's memory layer because the rule is the part that has to survive across every session. The agent does not relearn it. It just does it.

What compounds

The audit score ticking from 77 to 85 is not the point. What got cheaper is. The /audit pass surfaced a missing freshness column in connections.md, an old voice sample I had outgrown, and a reference doc whose code paths had moved. Each fix made the next session faster. The agent stopped re-explaining state I had already encoded.

Bigger payoff: the Aegis migration. I spent a day bulk-staging 75 skills, 66 commands, 5 agents, and 8 hooks from my desktop to the always-on minipc that runs Jon-OS 24 hours a day. None of those got built. They got recon'd, staged, and gated behind a settings diff, then activated. The hard work is path rewrites and activation order, not authoring.

The one I lit up first: the notebooklm skill. NotebookLM is now reachable as a CLI on this minipc, with ~/vault/CLAUDE.md mapping each domain to its dedicated notebook. I wrote a thin wrapper at ~/.claude/workflows/nlm-queue.py so the session-extract skill can auto-source captured insights to the right notebook (queues on auth failure, drains after re-auth). That makes NotebookLM scriptable from inside the content pipeline. The post you are reading now leaned on manual research. The next version of this pipeline can lean on a NotebookLM query inside the prompt template. Same discipline, less typing. Every shipped rail makes the next one cheaper.

The slower payoff is discovery. Three new YCAH members joined from three distinct platforms in the last month. Dev.to, Reddit /r/ClaudeAI, and TikTok. Three different surfaces, reached without me hand-writing each variant. Pipeline output, then discovery, then membership. The compounding works on the audience side too.

What this is not

This is not vibe coding. I do not trust one-shot scripts near prod. Not because someone on Twitter said so. Because I have been the person who shipped a one-shot, watched it work, and then watched it quietly stop working two weeks later for reasons I had no audit trail to understand. Front-loading the work is what makes the guardrails routine instead of heroic.

It is also not stacking Notion plus Claude plus ChatGPT plus Cursor and calling it a system. Those are fine tools. The mental model treats each one as an island. The control stack treats them as one composable system with a known read order. The intended read path is identity, state, registries, decisions, voice, scripts, memory, then the conversation. The point is not magic order. The point is that the context lives in stable files. Add a tool, you add a row in connections.md and maybe a wrapper in scripts/. You do not add a context-switch tax for yourself or the agent.

It is not theory. Every implementation claim above has a file path on my machine. Set up the same shape and you can grep your own.

What to copy first

If you are going to copy one piece, copy connections.md. Knowing what your Agent OS can reach is the cheapest unlock. You will build everything else against it.

Second: a decisions/log.md you actually keep. Append-only. One entry per non-trivial decision. The audit trail compounds faster than you think. By month two, you stop relitigating choices you already made and forgot you had made.

Third: seed Claude Code's auto-memory layer with your voice, your workflow, and your role. Once it is there, every conversation gets cheaper. You stop reintroducing yourself. The agent stops asking you the same orientation questions every Tuesday.

That is the order. Connections first, decisions second, memory third. Three files, in three days, before any new tool gets wired up. The bridge that runs my member sync was not built because I had a clever agent. It was built because the substrate was already there, and I used it. The discipline is mine. The files just stayed where the next session could find them. Boring on purpose.

Open one of those three files in your editor tomorrow morning. Write the first row. The compounding starts there.

Inside YCAH, I am doing this same work in public. The 4-file substrate, the workflow rule, the spec-first discipline, applied weekly to whatever members are shipping. Practitioner depth, not vending-machine demos.

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

	Why Are There Only 13 DNS Root Servers For The Whole World? Is that a problem richarddjarbeng - May 7
	AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems praneeth - Mar 31
	The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance Ken W. Algerverified - Apr 28
	The End of Data Export: Why the Cloud is a Compliance Trap Pocket Portfolioverified - Apr 6
	Agent Action Guard praneeth - Mar 31

If you are using Claude Code in production, the agent is not the problem

What context engineering actually means in Claude Code

The layers, one folder at a time

A worked example: the Skool transactional rail

What compounds

What this is not

What to copy first

0 Comments

Please log in to comment on this post.

More Posts

Why Are There Only 13 DNS Root Servers For The Whole World? Is that a problem

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

The End of Data Export: Why the Cloud is a Compliance Trap

Agent Action Guard

More From Jon

I built a Claude Code plugin for brand building in a weekend. Here's what I shipped and what I learned.

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,149 amazing developers

Don't have an account? Sign up

OR

If you are using Claude Code in production, the agent is not the problem

What context engineering actually means in Claude Code

The layers, one folder at a time

A worked example: the Skool transactional rail

What compounds

What this is not

What to copy first

0 Comments

Please log in to comment on this post.

More Posts

Why Are There Only 13 DNS Root Servers For The Whole World? Is that a problem

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

The End of Data Export: Why the Cloud is a Compliance Trap

Agent Action Guard

More From Jon

I built a Claude Code plugin for brand building in a weekend. Here's what I shipped and what I learned.

Related Jobs

Commenters (This Week)