The line I drew on AI coding, and how I've already smudged it

When I started using AI agentically, I drew a line. The line was about planning. Before any code got written, I’d sit in a long chat with the model, scoped to the feature I wanted to build, until we had a shared understanding of what we were doing and why. Then a PRD. Then issues. Then code. The point of the line was that I’d still be the one making the decisions that mattered, and the model would be the one doing the typing.

That’s the version I’d give if you asked. The honest version is that the line has already smudged in three different places, and one of them bothers me more than the other two.

The clearest example of the line working the way it’s supposed to is a parts-data migration I did recently for a client. They had two sources of truth. About five hundred parts on the existing WordPress site, mostly clean, but with fields that needed splitting for a new schema. About a thousand parts in a spreadsheet, badly maintained, no clean ownership, some of them overlapping with the site and some not. The CSVs were too big to eyeball. I went into the planning chat assuming the AI would just read the data, clean it, and hand me back a finished file.

The chat talked me out of that. Not because the AI couldn’t have tried, but because the result wouldn’t have been reproducible, wouldn’t have left a paper trail, and wouldn’t have been re-runnable when the spreadsheet inevitably changed again. What we landed on instead was that the AI would write a Python script, under my direction, that merged the data deterministically. The AI’s job wasn’t to clean the data. The AI’s job was to look at the script’s input and output, find one systemic flaw at a time, fix it, and log the decision.

AI doesn’t do the work. AI writes the script that does the work. That distinction came out of forty minutes of asking what reproducibility and auditability looked like for this particular client.

What fell out of that chat wasn’t a feature spec. It was a way of working. An iteration loop with rules attached: read the input, sample the outputs, pick one systemic flaw, fix it, append a structured entry to a decisions log, commit, push, loop. None of that was in my head when I opened the chat. The planning surfaced the shape of the solution, not just the implementation.

The planning was the line at its best. What happened once the code started running wasn’t.

I barely read the commits as that migration loop ran. The justification I told myself was that there’d be a full human review of the merged parts data at the end, so each script iteration didn’t need eyes on it. In hindsight that was probably the wrong call. A downstream review catches per-row data problems, not “the script started doing the wrong thing in iteration seven.” And the model was working in a domain it didn’t fully understand, machine parts, which is exactly the situation where I should have been reading along instead of trusting the loop.

The second smudge was on a different project: a personal tool for finding roles via LinkedIn. The code was tested. We did TDD. What I didn’t do was a thorough manual QA pass with real data. The seed and mock data I’d given it wasn’t representative enough of the actual shape of LinkedIn data, so things passed in development that broke the moment they met production. The smudge there isn’t “no tests.” The smudge is letting a green test suite stand in for using the thing the way a user would.

And then the one that makes me feel dirty. Across most of my recent projects, I’ve found myself skimming PRs. Not reading. Scrolling. The specific instance I can still picture is the LinkedIn tool again. A batch of small UI changes the model had made. I opened the PR, scrolled to the bottom, approved. The previous few runs had been fine, and I’d absorbed that as “this one will also be fine.” It wasn’t a decision. It was a momentum. I only noticed afterwards, which is the part I keep coming back to.

The first two were calls I made and could argue about with myself. The third one wasn’t a call. It was a habit, and I caught it on the way past.

I have old blog posts about how to do pull requests well. About what a thoughtful review looks like. And here I am scrolling through a diff at a speed that wouldn’t have passed my own bar five years ago. That’s the dirty feeling, right there. Not the AI. Me, holding myself to a lower standard for AI-authored code than I’d hold a human teammate to. Including, embarrassingly, past-me.

My flow keeps changing. Any rule I’d write down right now about “plan before, review after” would be true for this week and out of date in a month. So the line itself is a moving target. The line moves. The standard underneath it doesn’t.

The standard is: code I understand, and could maintain without the model. The line is whatever practice, this month, makes that standard true. This month the practice is a long planning chat before code, a PRD, slice issues, and, after this post, actually reading the diffs again instead of scrolling. Next month it will probably be something else, because the tools are moving and so is my sense of where they’re trustworthy.

There are two signals I’m watching for, that would tell me the line has stopped making the standard true. The first is opening code I supposedly own and finding it a mess I can’t read or reason about. Every dev forgets what their own code did six months later. That’s not a signal of anything except being human. The signal is the version of that where I clearly never understood the code in the first place. Forgetting is normal; never having known is the failure.

The second is the skimming itself. Catching myself scrolling rather than reading, in the moment, not in a quarterly review. Which means it’s not a state I get to check on once in a while. It’s a behavior I have to keep watching for, on every PR, for as long as the model is the one writing them.

The line will keep moving. I’d rather have a standard that holds than a rule that doesn’t.