29 March 2026
I’ve been experimenting with Claude Code on my side project hosh, an experimental shell written in Java. After a few days of pairing with an AI agent on real tasks — adding features, fixing bugs, resolving SonarQube warnings — here’s what I found.
The idea was simple: take a mature codebase (~1,500 commits, JDK11, Maven,
junit, archunit, >90% test coverage) and see how productive I could be with an
AI coding agent. Not on a greenfield toy project, but on something with real
constraints — a module system, a SPI and some opinions about code style.
I started by throwing tasks at Claude Code the way you’d throw them at a new team member:
The results were mixed. Some changes were good. Others were technically correct but stylistically wrong — the kind of code that passes CI but fails code review.
The turning point was recognizing that an AI agent without context is like an engineer without onboarding. It will produce something, but not necessarily what you want. And every wrong turn costs tokens.
So I wrote a CLAUDE.md file. This is a project-level document that Claude Code
reads at the start of each session. Mine covers the build commands, the
architecture (SPI, runtime, modules), the testing philosophy (TDD, no mocks of
types you don’t own), the code style (boring code over clever code, explicit
over implicit), and the things that are off-limits.
The effect was immediate. With guardrails in place, Claude stopped reinventing decisions that were already made. It respected the module boundaries. It wrote tests first. It stopped suggesting Gradle or Kotlin.
This is the part that surprised me: the CLAUDE.md is not just documentation
for the agent. It forced me to articulate things I had only kept in my head —
project conventions, architectural invariants, the reasoning behind certain
trade-offs. Writing guardrails for an AI turned out to be a useful exercise in
making implicit knowledge explicit. Writing is thinking made visible.
What makes Claude Code (or any other coding agent) genuinely useful is the
feedback loop speed. The cycle is: describe a task, review the proposed changes,
accept or reject, refine. Each iteration takes seconds, not hours. You can
experiment with approaches — “try it with an AtomicReference”, “now try it
with an immutable snapshot” — and compare the results almost instantly. Design
options are cheaper to explore.
But the loop only works well if you stay in it. The moment I tried to delegate a larger task and walk away, the quality dropped. Not because the agent isn’t capable, but because the accumulated micro-decisions in a mature codebase require judgment that no amount of context can fully replace.
The best sessions were the ones where I acted as a director: setting the goal, reviewing each step, steering corrections early. The worst were the ones where I tried to use Claude Code as an autonomous worker.
This maps surprisingly well to what Anthropic themselves have found. Their research on agent autonomy shows that experienced users shift from approving each individual action to a monitoring-and-intervening approach — but they also interrupt more often, not less. Trust grows, but so does the skill of knowing when to intervene (see Anthropic’s research on agent autonomy).
There’s a deeper question here: does the “human as director” model resonate with how AI tools should work? I think it does. The most productive pattern I found was collaboration — not delegation. The agent handles the mechanical work (boilerplate, test scaffolding, refactoring), while the human handles the architectural judgment, the “should we even do this?” questions, and the quality bar.
This isn’t a limitation. It’s the point. A fully autonomous agent that produces code I have to review anyway doesn’t save me time — it shifts the work from writing to reading, which is arguably harder. An agent that works with me, proposing changes I can steer in real time, actually makes me faster.
A few concrete takeaways:
Invest in CLAUDE.md early. It pays for itself in the first session. Think
of it as onboarding documentation — for an engineer who forgets everything
between sessions.
Keep the loop tight. Small, focused tasks with immediate review produce better results than large, vague ones. This is true for human engineers too, but with AI agents the effect is amplified.
The agent is a multiplier, not a replacement. It multiplies whatever direction you give it. Good direction, good output. No direction, plausible but wrong output.
Token cost is a design constraint. Every unnecessary file read, every wrong turn that needs correction, every overly broad context window costs real money. Guardrails aren’t just about quality — they’re about efficiency.
The hype around AI coding tends to oscillate between “it will replace developers” and “it’s just autocomplete”. The reality, at least for now, is more interesting than either: it’s a new kind of collaboration that rewards the same skills good engineering always has — clarity of thought, explicit constraints, and knowing when to trust your tools and when to override them.
JDK11 → JDK25 - upgraded all other dependencies to 2026The hosh source code, including the
CLAUDE.md, is available on GitHub.
Rahul Garg’s Encoding Team Standards
makes the case for treating these instructions as team infrastructure — versioned, reviewed, and shared.
The CLAUDE.md approach fits exactly into that pattern.