Roger Fleig

Stop Calling Coding Agents Eager Interns

Roger Fleig — Tue, 28 Apr 2026 00:03:46 GMT

The popular framing of coding agents as eager interns gets them almost exactly backward. The agents I worked with were often the opposite of eager — cautious to the point of inaction unless they had a clear license to proceed. Under the right constraints, what they displayed looked much more like engineering judgment than eagerness: caution around broad changes, sensitivity to sequencing, a willingness to stop when the scope ran out. The intern framing tells you to teach. What I needed to do was define the lane, then stop standing in it.

I only figured this out because I spent a while being annoyed by it first.

I built a small agent simulator on the Roblox platform earlier this month — 107 Luau files by the end — and in the rush of getting something working I skipped the code health infrastructure: no formatter, no linter, no typechecker. A mistake I knew I was making and made anyway. Late in the project, with the core system working, I went back to fix it.

I asked my coding agent to set up formatting, style linting, and static type checking, then enforce them across the existing codebase. The agent installed the tooling, wired it into the build, and reported done.

Almost nothing had changed.

I pushed harder. It did a little more — a handful of files, --!strict added at the top, type errors resolved. Then it stopped. I pushed again. Same result. Iteration after iteration I kept escalating the request and kept getting back careful incrementals, one file at a time, every single one clean and correct and frustratingly small.

Here’s what one of those commits looked like:

--- a/src/mechanics/CrestDipBuilder.luau
+++ b/src/mechanics/CrestDipBuilder.luau
@@ -1,3 +1,5 @@
+--!strict
+
 -- CrestDipBuilder.luau
...
-local CrestDipBuilder = {}
+type CrestDipBuilderModule = {
+    build: (state: SectorState, entryFrame: CFrame, exitFrame: CFrame) -> (),
+}
...
-function CrestDipBuilder.build(state: SectorState, entryFrame: CFrame, _exitFrame: CFrame)
+local function build(state: SectorState, entryFrame: CFrame, _exitFrame: CFrame): ()
     local p = state.params
-    assert(p.height_or_depth, "CrestDipBuilder: missing height_or_depth")
+    local height = assert(p.height_or_depth, "CrestDipBuilder: missing height_or_depth")

Precise. Complete. One file. Then stop.

I had assumed the agent was simply being timid. While writing this post I went looking for why, and found something I hadn’t known was there. The project’s AGENTS.md — a contract file that coding agents read at the start of each session — contained this line: “Do not perform bulk --!strict upgrades as part of routine hygiene; ratchet strictness module-by-module with explicit scope.” I hadn’t written it directly. One of the earlier coding agents had added it during a handoff compaction pass, promoting a lesson from a previous session into a standing rule. I hadn’t noticed it was there. The agent demurring on my requests wasn’t defiance or timidity — it was following guidance it had written for itself, because it had already learned that bulk strict upgrades on a working codebase are the kind of thing that breaks things quietly.

Worth noting: AGENTS.md is soft guidance, not enforcement. Agents can and do trade it off against local objectives when pushed hard enough — I’ve written about that elsewhere. But in this case it held, which made the conservatism invisible to me until I went looking for it.

My reaction at the time was still frustration. I had approved this work a dozen times over in spirit. Why was the agent making me re-approve each file?

Eventually I stopped pushing and tried something different — a mode I came to think of as a first-class collaboration tool: “don’t design or implement, just chat with me.” The explanation that came back was the clearest writing I got from any agent all week.

It laid out the three distinct goals hiding inside “turn on type safety everywhere.” First: make the analyzer understand the Roblox platform at all — without the right engine type definitions, the typechecker would flag Vector3 and CFrame (built-in Roblox 3D types, not anything in my code) as unknown identifiers, producing noise rather than signal. Second: make every file pass the checker cleanly — no errors, but still in the default permissive mode where many type problems are simply not checked. Third: make every file declare itself --!strict, opting into the full type discipline. The first goal had only just finished. Doing the second and third before the first was done would have produced “strict” files whose strictness was illusory — every file flagged clean, with real type errors hidden behind a wall of tool noise.

Then the metaphor that stuck:

The repo is a workshop. Phase 31 put the lights on. Now you can actually see the clutter. The clutter was already there. Turning on the lights did not create it.

What I had been reading as refusal was sequencing. Mass-flipping files before the analyzer was ready would have produced meaningless results — and the agent, apparently, had already worked that out.

The “eager intern” framing makes sense as a description of first contact with these tools. They generate text confidently, they need correction, they can overstep. That pattern feels intern-like, and the management model it suggests — corrective feedback, expectation-setting, teaching — feels reasonable.

And to be clear: overreach is real. Agents can make sweeping unreviewed changes, operate outside the operator’s intent, or cause serious damage when the hard constraints aren’t there — no guardrails, no sandboxing, no limits on what the agent can actually touch. That’s a real failure mode and it deserves serious attention.

What I’m describing is different — and I think it only becomes visible once the hard constraints are actually in place. With the harness set up properly, the tests running, the scope defined, the rollback available, what I kept running into wasn’t overreach. It was the opposite. The agent knew where the edge was and was treating it as a hard stop rather than a suggestion. An intern overshoots because they don’t yet know where the line is. What I saw looked more like a senior engineer who understood the line clearly and wasn’t going to cross it without being explicitly told to.

That reframe changes the workflow problem. If the agent has the judgment but not the green light, the answer isn’t more correction or more prompt engineering. It’s a clearer scope — a phase plan, an explicit policy, a state file that says what’s in bounds this run and what requires a human decision. Not to enable recklessness, but to give a capable and cautious collaborator the structure it needs to actually do the work.

I eventually ran the full migration as a campaign: a small orchestrator that invoked the agent file-by-file against a state file with explicit budgets and escalation rules. The campaign carried the project from 27 strict-typed files to 90 across five unattended runs: one escalation, sixty-three clean commits, and no broad rewrite. The agent didn’t need me in the room for most of it. It needed me to have been clear about the rules before it started.

That’s not a story about giving an agent more power. It’s a story about giving a careful agent a well-defined lane.

This comes from building Autotrack, a Roblox agent simulator. The longer version — including how the campaign mode orchestrator works in detail — is in this essay. I write about agentic infrastructure on Substack.

The Case for Architectural Linting

Roger Fleig — Tue, 14 Apr 2026 22:49:23 GMT

Photo by Adarsh Chauhan on Unsplash

Every senior engineer knows this dynamic, even if they’ve never named it.

A new developer joins the team. Their first few PRs are fine — the code compiles, the tests pass — but something is off. They reach for raw SQL instead of the query wrapper. They have the API handler call the data layer directly, bypassing the service layer. They add a new HTTP client instead of using the instrumented one the team spent three months building. None of this breaks anything. All of it violates how the system is supposed to fit together.

So a senior engineer leaves a review comment. And another. And another. The new developer adjusts. Over weeks and months, they absorb the unwritten rules — not because anyone handed them a document, but because someone took the time to correct them, and they internalized the corrections. Eventually they stop needing the comments. They start leaving the comments themselves.

This is the apprenticeship model of architectural knowledge, and it’s the hidden load-bearing structure of code review. We talk about review as a quality gate — catching bugs, verifying correctness. But its second function is at least as important: transmitting the implicit rules of how a codebase fits together, one review comment at a time, until the new contributor stops being new.

The economics of this system make sense when the contributors learn. The cost is front-loaded; the payoff compounds. A senior engineer invests heavily in a junior engineer’s first fifty PRs and then recoups that investment over the next five hundred. The apprenticeship model is expensive, but it’s an investment with returns.

Agents break this model at the root.

Code review used to do two jobs: validate changes and transmit architectural judgment. Agents do not remove the need for the first, but they break the second. If that judgment still matters — and it does — it has to move out of reviewer memory and into enforceable checks.

This is the sixth in a series on The Narrow Pipe. The last post was about review as a bottleneck — machine-speed generation meeting human-speed validation. This one is about review’s second job: transmitting architectural intent. And what you build when agents make that transmission mechanism obsolete.

Why Agents Break the Economics

A coding agent doesn’t accumulate architectural judgment across sessions. It doesn’t feel the friction of a review comment and adjust its priors for next time. In the way that matters here, every agent PR is a first PR: the feedback may fix this change, but it rarely compounds into lasting architectural judgment. You can reject the change, explain the violation, and the agent will fix it. But the next session, on a different task, it will reach for the raw SQL again.

You are not front-loading cost for a future payoff. You are paying the full cost of architectural review on every change, forever, with no compounding return on that investment. The economics that justified using review as an enforcement mechanism simply do not apply.

This matters more than it sounds like it should, because the apprenticeship model wasn’t just one way of transmitting architectural knowledge — in most codebases, it was the primary one. Written architectural rules are rare. The rules live in the heads of senior engineers and get transmitted through review. When the mechanism that transmits them stops working, those rules don’t get transmitted at all.

The only way the rules survive is if someone encodes them explicitly.

Why “Tests Pass” Stops Being Reassuring

CI as we know it answers a narrow question: does this code compile, and do the existing tests pass? That was a reasonable floor for quality when every change was written by someone who had absorbed months or years of architectural context. It is not a reasonable floor when changes are generated at machine speed by tools that have no model of how the system fits together.

Tests validate behavior you encoded. Architecture often lives in behavior you never encoded.

The gap between “tests pass” and “this code is correct” shows up clearly in the data. Veracode’s longitudinal analysis of AI-generated code finds that syntax correctness now exceeds 95% — code that compiles and runs — but security pass rates have plateaued at roughly 55%. The gap persists because the feedback loop rewards “works,” not “safe.” Tests can show the code works without showing that it is safe. The invariant isn’t encoded in the feedback loop, so the agent never learns to respect it. The same dynamic often applies to architectural constraints not captured in your test suite.

Without enforcement, agent-generated code doesn’t fail dramatically. It drifts. An agent violates an unwritten convention. The code passes CI. A reviewer approves it under volume pressure. The violation becomes part of the codebase. Future agents retrieve the violated pattern as context. The violation becomes the new normal. At human scale, that gap was filled by the apprenticeship model. At agent scale, it’s not filled by anything — unless you build something to fill it.

What Architectural Linting Actually Means

The response to this problem has a name, even if it’s not yet universal: architectural linting. Not style linting — formatting and naming conventions are already solved. Not static analysis — null checks and type errors are already solved. Not security scanning, though that’s closer. Architectural linting is a new category of deterministic checks that encode structural invariants — the rules about how a codebase fits together that currently live only in senior engineers’ heads.

Concrete examples make this tangible. Dependency direction rules: “the API layer never imports from the data layer.” Module boundary enforcement: “this service owns these tables and no other service queries them directly.” Pattern compliance: “all database access goes through this wrapper, not raw SQL” or “all HTTP clients use the telemetry-instrumented client, not the standard library.” Deprecated path detection: “this API looks available but has 200 services depending on its side effects — don’t add a 201st.” Change scope constraints: “modifications to shared utilities require explicit blast radius acknowledgment.” Some are architectural in the classic sense; others are engineering invariants experienced teams learn the hard way and should stop relying on memory to enforce.

A word on what architectural linting is not. Repo-local guidance files — AGENTS.md, CLAUDE.md, Cursor rules files — are useful for prevention. But they are guidance, not enforcement. An agent can forget them, ignore them, or trade them off against a local objective. Architectural linting is different: a hard check that can reject a change for architectural reasons, regardless of whether the agent meant well. The distinction matters because guidance can be forgotten, ignored, or traded off — a hard check cannot.

These are not hypothetical. CyberAgent’s engineering blog describes encoding Clean Architecture layer dependency rules in YAML configuration and integrating the checks into CI — and they explicitly call it an “architectural linter.” Tools like ArchUnit for Java and Dependency Cruiser for JavaScript and TypeScript already exist for enforcing layering rules programmatically.

Once architectural rules are machine-readable, they don’t have to live only at CI. Some agent frameworks now support enforcement points inside the loop itself — hooks that intercept tool calls or block session completion until architectural checks pass.

There is a second design choice here: not just whether a rule is enforced, but how visible its enforcement is to the agent. If the agent can inspect the test, policy, or check logic, that implementation becomes part of the optimization surface. Sometimes that is useful. Other times, it pushes the agent to optimize for the check rather than the rule. For those constraints, the safer pattern is to enforce the invariant behind an interface the agent can use but not reverse-engineer. Spotify describes a similar design in its Honk post: verifiers are exposed as an interface the agent can call, while their internal logic stays hidden. The same pattern can matter for architectural enforcement.

Architectural linting also directly addresses the narrow pipe. The review bottleneck exists because human review can’t scale to agent-generated volume. Architectural linting doesn’t replace review — but it means review can focus on judgment calls rather than catching invariant violations that a machine could have caught. The reviewer’s job shifts from “does this code respect our architecture?” to “is this the right thing to build?” That’s a higher-leverage question, and it’s the question that actually requires human judgment.

How to Start Monday

Here is one I learned the hard way: never do calendar math by hand. At Microsoft, SQL Server 2005 would not restart after a certain date because an engineer evaluated a certificate with hand-rolled date logic. I remember staying up all night as the day began in Asia, then Europe, validating the hotfix and waiting to see how much damage would unfold. Years later I saw a similar class of mistake again at Google. Enough times, at enough companies, that it became a personal invariant: never do calendar math by hand. The code compiles. The types check. You can still miss it in review and CI. It is still the wrong code. And for years, I felt the frustration of not having a reliable way to enforce that rule across a codebase.

Every senior reviewer has a few rules like that — rules written in scar tissue rather than design docs. Those are exactly the rules that should stop living only in memory and start living in enforcement.

Most organizations already have their first invariants. They just haven’t recognized them as such. Every outage that traced back to someone bypassing the wrapper, every regression caused by a layering violation, every incident that led a senior engineer to leave the same review comment for the tenth time: those are your invariants, already discovered and already painful. Don’t ask your senior engineers which architectural rules matter. Ask them which review comments they leave because something actually broke. The answers will be higher-signal, and they’ll come faster — because outages have a way of making implicit knowledge suddenly very explicit.

Pick two or three. Choose the simplest framework that can enforce them in CI. Run it for a month. What you learn — what’s harder to specify than you expected, what turns out not to matter, how agents respond to the guardrail — is worth more than any amount of upfront planning.

But two or three is a starting point, not the program. The goal is a growing body of enforced architectural knowledge — one that expands every time a senior engineer hits the same review comment again and decides to convert their frustration into a check. That loop is worth understanding: the reviewer who encodes an invariant is not doing extra work. They are buying back their own time. Every rule that moves into the linter is a violation they will never have to catch manually again — from a human, and especially not from an agent generating changes at volume. The reward for contributing to the system is a lighter review queue and higher-quality code before it ever reaches them.

The organizations that will run agents well at scale are the ones that gave their senior engineers a place to turn repeated architectural pain into enforcement.

Read the full paper →

The Pipe Was Always Narrow

Roger Fleig — Fri, 03 Apr 2026 03:46:27 GMT

Photo by Ujesh Krishnan on Unsplash

Around 2018, my team at Google built a system called Sensenmann that automatically deleted dead code at scale. It tracked binary usage across production and corporate desktops, and if a binary hadn’t been seen for months, the system generated a changelist to remove it root and twig. Phil Norman, one of my engineers, wrote about it. It submitted over a thousand deletion changelists a week and eventually deleted nearly 5% of all C++ at Google.

The technical system worked beautifully. The human system broke.

Developers started pushing back — and not because the deletions were wrong. They pushed back because even a correct, trivially simple change has a real cost when you’re the one being asked to approve it. You have to build enough context to be confident you won’t regret it later. You might be new to the project and not feel qualified to make that call. You might be heads-down shipping something for a launch and simply not have the bandwidth — no matter how correct the change is. Sometimes people just said no because they didn’t have the attention to think it through.

We ended up building something more sophisticated than a throttle. Teams could configure how many robot-authored changes they received and when. If an owner said “not now,” the system snoozed for a quarter and tried again — maybe the next owner would be ready to let it go.

That was before anyone had heard of an LLM. The changes were trivially simple deletions — no judgment required, correctness guaranteed. And it still broke the pipe.

This is the fifth in a series on The Narrow Pipe, and it’s about where the constraint bites hardest: not code quality, but the human capacity to absorb change.

The Same Mechanism, Larger Scale

Fast forward to early 2026, and the same dynamic is visible across major open-source projects.

GitHub’s Ashley Wolf calls it the “Eternal September of open source”: the cost to create has dropped, but the cost to review has not. The Augment Code data is the starkest version of this: across enterprise repositories, PR volume surged 98% and review time climbed 91% in step. All that extra generation velocity landed directly on reviewers.

Projects started closing the door. curl ended its bug bounty after a flood of AI-generated slop reports — and it was removing the incentive, not improving the tooling, that dried up the tsunami. LLVM’s policy explicitly calls unreviewed LLM output “extractive”: it shifts effort from the implementor to the reviewer.

The mechanism is not specific to open source. OSS got hit first because it has no access controls and volunteer reviewers with no “hire more” option. But I’ve watched the same thing play out inside an enterprise.

At Crusoe, my team had a rollout plan to terraform our GitLab instance across roughly 400 repositories. Before we could execute it, someone sent an AI-generated merge request covering the whole scope — unreadable, without a spec, just a prompt. To evaluate it, a reviewer would have had to reconstruct the entire context from scratch. That’s validation burden: it doesn’t appear in the PR count, and it doesn’t go away just because the code might be correct.

The second incident was subtler. Someone used an AI coding tool over a weekend to add a cache key I was deliberately holding back until after a larger refactor. The change looked plausible. Another engineer — one who wasn’t close to the full problem — approved it. Two messy weeks followed. The code wasn’t obviously wrong. The reviewer just didn’t have the context to know what they were actually approving, or what responsibility they were taking on.

The Terraform MR might have been fine. The deletion CLs were provably correct. Responsibility can’t be parallelized as cheaply as code generation can.

Engineering the Pipe

If the bottleneck is structural, the response has to be structural. All the approaches that have worked start from the same premise: reviewer attention is a finite resource. Engineer around it.

Narrow the scope of what you automate on the review side. Around the same time as Sensenmann, my team — in collaboration with a research group at Google Brain — built a system called AutoCommenter to handle a class of review tasks called “tips”: published preferences from the style guide, the low-hanging fruit of readability review. The training data came from the readability reviewers themselves — their own comments became the examples the model learned from, and then the system took over that specific class of work. It reached tens of thousands of developers daily. Manushree Vijayvergiya and my team published the results at AIware ‘24.

It worked because the scope was narrow: style-guide compliance, not architectural judgment. The same pattern is re-emerging as verifier agents and defensive AI. Reviewer-side AI becomes much less reliable when it tries to replace the judgment that makes human review expensive in the first place.

Build backpressure deliberately. You get one chance to waste a developer’s time — after that, they stop paying attention. If you don’t build backpressure into the system, humans create their own: by rejecting everything, or by rubber-stamping it. Sensenmann’s queueing system gave teams control over their own intake rate. curl learned the same lesson in 2026: removing the bounty incentive did more to reduce the flood than any tooling change. Design the queue, or the queue designs itself — badly.

Shift validation before the review. Stripe’s Minions system runs local lint and selective test execution from over three million tests before anything reaches a human, with a hard cap of two CI rounds. By the time a PR reaches a person, most of the correctness validation has already happened. The review becomes a judgment call about design alignment — not a forensic investigation.

Match review depth to risk. Reviewing everything at equal depth means reviewing nothing well. AutoCommenter freed readability reviewers for the harder judgment calls by handling the mechanical ones. Graduated autonomy applies the same principle explicitly: route low-risk changes through lighter review, and preserve human attention for the changes that genuinely need it.

Sensenmann’s deletion CLs were about as simple as a code change can be. Correct. Reviewable in seconds. They still broke the pipe — because correctness doesn’t reduce the cost of building enough context to feel confident taking responsibility for something. LLMs didn’t create that constraint. They widened the upstream side of it. Today’s AI-authored changes are larger in scope and arrive with more confidence. The reconstruction cost scales with them.

Addy Osmani asks what I think is the right question: how much of what we’re shipping do we actually understand? The pipe was always narrow. We just used to push less through it.

Read the full paper →

Wayfinding, Not Roadmaps

Roger Fleig — Fri, 27 Mar 2026 21:47:33 GMT

This is the fourth in a series on The Narrow Pipe, a position paper on the emerging infrastructure challenges of agent-scale software engineering.

Photo by Chris Hardy on Unsplash

There’s a natural instinct when engineering leaders encounter agentic development: build a roadmap. Read about Stripe’s Minions — a thousand agent-generated PRs a week — or hear Jensen Huang talk about 100 agents per engineer, and the next step feels obvious. Multi-quarter plan. Milestones, timelines, success criteria, projected ROI.

It feels like leadership. And it will often be wrong — not because the destination is wrong, but because the path cannot be known in advance.

What makes agentic infrastructure a genuinely complex problem — not merely a complicated one — is the mechanism underneath. Once code generation becomes cheap, the bottleneck shifts downstream into review, testing, coordination, and governance. But you usually don’t know in advance which of those constraints will dominate in your environment. In one organization it’s flaky tests. In another it’s context injection. In a mature monorepo it may be blast radius and review fatigue. The binding constraint varies by codebase, team structure, and organizational maturity — and it often isn’t what you’d predict.

That’s why the right unit of progress is not roadmap completion. It’s validated learning.

Complicated vs. Complex

One of the few corporate leadership courses that really stuck with me was on leading in complexity. It drew on Dave Snowden’s Cynefin framework and the work of Jennifer Garvey Berger and Keith Johnston. While writing The Narrow Pipe, I realized I had been applying that lesson almost implicitly. Experimentation kept surfacing in the paper not because it sounded good, but because when you’re operating in a genuinely complex domain, detailed prediction and linear planning don’t work. You make small, safe-to-fail bets, watch the system closely, and adapt based on what reality tells you.

A complicated problem has high predictability. Cause and effect are knowable. You can plan, execute, and arrive. You follow a map.

A complex problem has low predictability. Multiple variables interact in ways that are only clear in hindsight. The landscape shifts while you’re traversing it. A map is useless — not because you’re bad at cartography, but because the terrain won’t hold still.

In complex environments, you wayfind. Like Polynesian navigators reading currents, wind, and stars, you orient by signals rather than coordinates. You run small, safe-to-fail experiments. You amplify what works, dampen what doesn’t. You navigate by learning, not by planning.

Agentic infrastructure is firmly in the complex domain right now. The capabilities your team builds around are changing from quarter to quarter — not because anyone is doing anything wrong, but because the underlying models and tools haven’t stabilized.

What This Looks Like in Practice

Let me make this concrete. Say your team is rolling out autonomous agents and they’re iterating in CI loops until tests pass — a common agent workflow. The agents seem to be working, but tasks are taking longer than expected and compute costs are climbing.

Is the problem model quality? Context injection? Task decomposition? You don’t know yet. So you instrument.

You discover that your test suite has a 7% flake rate. That sounds tolerable — it was tolerable when humans ran tests a few times a day. But agents run tests in multi-stage loops, and flake probability compounds across stages. At 7% flake rate across 10 stages, the probability of a clean run is about 48%. More than half of your agent workflows are hitting false failures, triggering retries, burning compute, and sometimes making unnecessary code changes to “fix” tests that weren’t actually broken.

No roadmap would have told you that. You had to instrument the pipeline, measure the actual failure modes, and discover that flake rate — a problem you’d lived with for years — had become an agent-halting condition. The constraint moved, and you found it by looking, not by planning.

That’s wayfinding.

Experimentation as Navigation

In my paper, I propose a structured experimental agenda for agentic infrastructure, and I arrived at it through the developer productivity measurement tradition — DORA, SPACE, DevEx — rather than complexity theory. But the convergence with wayfinding is striking:

Safe-to-fail probes. Each experiment is scoped to produce knowledge regardless of outcome. Instrument the pipeline for agent cost visibility. Compare agent behavior with and without flake detection. Test different context injection strategies. A negative result narrows the search space.

Signal reading over plan tracking. Rework Rate by source, autonomy duration, quality-adjusted throughput — these are instruments for reading the currents. When leading indicators (fast merges, high PR volume) diverge from lagging indicators (rising regressions, declining code health), that divergence is the signal. Only live instrumentation can surface it.

Multiple perspectives. Complexity cannot be seen from a single vantage point. DORA measures delivery performance. SPACE measures developer experience. DevEx focuses on feedback loops and cognitive load. No single metric tells the story. The Activity Trap is what happens when you collapse a complex system into a single number.

Autonomy Is Not Maturity

There’s a point from the paper that fits here and that I think is frequently overlooked: increasing what the AI does (autonomy) without proportionally improving the ability to verify what it did (controls) and to manage permissions, audit trails, and accountability (governance) creates risk, not progress.

This reinforces the case for experimentation. You’re not just exploring because the technology is changing fast. You’re exploring because the organization has to discover the right coupling between autonomy, verification, and governance in its specific environment. That coupling is different for every codebase, every team, every risk profile. It can’t be copied from a case study. It has to be measured.

Planning at the Right Level

I want to be precise about what I’m arguing against. It’s not planning — engineering leaders still need budgets, staffing, capacity planning, and sequencing. It’s planning at the wrong level of precision. It’s committing to a specific solution set before you’ve instrumented the problem. It’s treating “deploy agents across the org” as a roadmap item rather than a learning agenda.

The wayfinding posture means having a clear compass — agentic speed should translate into business outcomes without eroding quality — while accepting that the path will be discovered through disciplined experimentation rather than predicted through planning. You plan capacity and guardrails. You do not roadmap your way to truth.

The developer productivity discipline already has useful instincts for this: instrument before you optimize, measure outcomes rather than visible activity, and check whether your proxy metrics still track real value. The agentic era changes the variables, but not the need for disciplined observation.

The organizations that navigate this well won’t be the ones with the cleanest roadmap. They’ll be the ones that learned to read the water.

Read the full paper →

The Parallelism Thesis

Roger Fleig — Thu, 26 Mar 2026 19:38:36 GMT

This is the third in a series on The Narrow Pipe, a position paper on the emerging infrastructure challenges of agent-scale software engineering.

I keep having the same conversation with engineering leaders about agentic development, and it keeps stalling in the same place. They see AI coding tools as a way to make individual developers faster. That’s true, but it’s the smallest version of what’s happening.

The real unlock isn’t speed. It’s concurrency.

Here’s the simplest framework I’ve found for explaining where we are and where this is going. It maps roughly to the autonomy spectrum I described in my first post — but where that framework asks who is driving, this one asks how many lanes are open.

Tier 1: Co-pilot

You write code. The AI helps. It autocompletes, generates functions from comments, pairs with you on hard problems. Copilot, Cursor, and Claude Code in attended mode live here.

The gains are real. For certain tasks — boilerplate, test generation, unfamiliar APIs — it’s meaningfully faster. But the throughput ceiling is you. Every line of output still flows through your hands. You’re one developer, maybe 2–3x more productive on good days, on the right tasks.

It’s a faster single thread.

This is where the vast majority of developers are today. And it feels like the ceiling, because the speed improvement is tangible and immediate.

It is not the ceiling.

Tier 2: Manager of Agents

The role flips. You stop writing code and start defining tasks, decomposing work, and reviewing what comes back. The AI writes; you review. You’ve shifted from senior engineer to engineering manager.

But here’s the part most people miss, and it’s the whole game: the moment you make that shift, there’s no reason to manage just one agent.

At Tier 1, you and the AI share a single thread of execution. You’re in the loop for every decision. The AI is fast, but you’re the bottleneck — you can only attend to one task at a time.

At Tier 2, you can have five, eight, ten agents running concurrently on different tasks. Each one is in its own inner loop — reasoning, writing code, running tests, iterating — with very little need to surface for human input. You’re not copiloting anymore. You’re reviewing completed work as it arrives, while other agents continue making progress in the background.

This is the difference between a 2–3x speedup and a fundamentally different throughput curve. One fast agent on one task is incremental. Eight agents on eight tasks simultaneously is multiplicative — and you haven’t gotten faster at any individual task. You’ve gotten parallel.

The Tier 1 Trap

Tier 1 feels productive because the feedback loop is tight. You ask, the AI responds, you iterate together. It’s interactive and satisfying. The gains show up immediately.

Tier 2 requires a different posture. You have to get comfortable with agents running autonomously — making decisions you haven’t reviewed yet, taking approaches you might not have chosen. The feedback loop is longer. The control is looser. For many experienced engineers, that loss of direct control feels like a step backward, even when the output volume is dramatically higher.

I think this is the real barrier. It’s not technical. It’s psychological. The engineers who are best at their craft — the ones with the deepest instincts about code quality, architecture, and taste — are exactly the ones who find it hardest to let go of the keyboard. The skill that made them great (deep attention to every line) is the skill that prevents them from accessing the parallelism that makes Tier 2 transformative.

This is also, incidentally, what the METR study found: developers with the deepest familiarity with their codebases experienced the largest slowdowns when using AI tools in attended mode. The more you know, the more you correct — and if you’re correcting one agent in real time, you’re back to a single thread.

The Infrastructure Gap

There’s a second barrier, and this one is technical. Running one agent is easy. Running many agents concurrently starts surfacing problems that don’t exist at Tier 1: agents stepping on each other’s work, CI queues backing up, merge conflicts from parallel changes, review backlogs that exceed what any human can meaningfully process.

The downstream pipeline — builds, tests, review, integration — was designed for human-speed code production. When you multiply the input by 8x, those systems become the bottleneck. This is the narrow pipe problem: the constraint was once producing code. Now it’s everything after producing code.

The uncomfortable implication: the parallelism that makes Tier 2 valuable is the same parallelism that breaks the pipeline. You can’t have one without solving the other.

The ROI Math

The economic case for Tier 2 over Tier 1 is straightforward once you see it:

Tier 1 ROI = (speed gain on individual tasks) × (one task at a time)

Tier 2 ROI = (agent task completion rate) × (number of concurrent agents) − (review cost + coordination overhead + infrastructure cost)

Tier 1 is a linear improvement. Tier 2 is a scaling function. Even if each individual agent is somewhat less effective than you would be doing the task yourself — and in many cases it will be — the aggregate output across many parallel agents can far exceed what a single developer can produce, at any speed.

The catch is the subtracted terms. Review cost grows with agent count. Coordination overhead grows nonlinearly. Infrastructure cost is real. If your downstream systems can’t absorb the volume, the gains evaporate. But these are engineering problems with known solution patterns — not fundamental limits.

Tier 3: Spec-Driven Development

You define what to build and the acceptance criteria. Agents handle the how end-to-end. Hours later, you check results against specs and tests. The human role is product thinking and verification, not implementation.

This is operational today for certain classes of work — greenfield features with clear specs, well-bounded refactors, tasks where acceptance criteria can be fully automated. StrongDM’s “Attractor” system operates here: specifications go in as markdown, agents write the code, other agents test it against holdout scenarios the coding agents never saw.

It’s premature for most work on mature, complex codebases where institutional knowledge and implicit invariants make fully autonomous operation risky. But it’s where the trajectory points — and the infrastructure you build for Tier 2 (agent identity, coordination, blast radius awareness, test reliability, context injection) is what makes Tier 3 possible when the models and the guardrails are ready.

The Investment Case

This framing clarifies where infrastructure investment pays off.

You don’t need agent identity, coordination systems, or blast radius awareness for one co-pilot. You don’t even need most of it for one autonomous agent. But you absolutely need it the moment you’re running ten agents concurrently on a shared codebase. The infrastructure investment is what converts the theoretical parallelism of Tier 2 into realized throughput.

Organizations that stay at Tier 1 will see steady, incremental productivity gains. Organizations that make the jump to Tier 2 — and build the infrastructure to support it — access a different curve entirely.

The ceiling isn’t how fast one agent can code. It’s how many agents you can run in parallel before the downstream systems break. Raise that ceiling, and you change the economics of the entire engineering organization.

Read the full paper →

The $250,000 Vanity Metric

Roger Fleig — Wed, 25 Mar 2026 18:13:10 GMT

This is the second in a series on The Narrow Pipe, a position paper on the emerging infrastructure challenges of agent-scale software engineering.

Last week at GTC 2026, Jensen Huang laid out his vision for token budgets as engineer compensation — half a base salary, on top of pay, so engineers can be “amplified 10x.” Then on the All-In Podcast, he made it vivid: if a $500,000 engineer only spent $5,000 on AI tokens in a year, “I will go ape.” His benchmark? At least $250,000 in annual token consumption. When asked if Nvidia is spending $2 billion a year on tokens for its engineering team, he answered: “We’re trying to.”

This landed the same week as a New York Times piece that introduced the term “tokenmaxxing” — a status competition among engineers to consume the most AI tokens. Internal leaderboards tracking token consumption. Managers factoring raw AI usage into performance reviews. Individual engineers burning billions of tokens per week running swarms of parallel agents around the clock.

Jensen is the CEO of the company that sells the compute. Of course he wants engineers burning through a quarter million dollars in tokens. But when that framing trickles down into how engineering organizations evaluate productivity, it becomes something more dangerous than a sales pitch. It becomes a measurement system. And it’s measuring the wrong thing.

The Activity Trap

The most important measurement anti-pattern in agentic development has a name: the Activity Trap. It means measuring activity instead of outcomes — counting what’s easy to count rather than what actually matters. Jensen’s $250K benchmark is the Activity Trap in its purest and most expensive form: it measures input — resources burned — not output, and certainly not outcomes.

Agents make this trap catastrophically worse because they can generate enormous volumes of activity. An agent that opens twenty PRs in a day looks productive by activity metrics. If fifteen require multiple review rounds, three introduce regressions, and two get reverted — the net outcome may be negative. Activity metrics for agents aren’t just uninformative. They are actively misleading, because they create incentives to optimize for volume over value.

Lines of code generated. PR counts. Token consumption. Prompt counts. These aren’t productivity metrics. They’re input metrics. They tell you nothing about whether the output is correct, maintainable, or even used.

The 40-Point Perception Gap

Here’s what makes this genuinely dangerous: developers can’t tell either.

A 2025 randomized controlled trial by METR (Model Evaluation & Threat Research) produced the most rigorous empirical test of AI-assisted development productivity to date. Sixteen experienced open-source developers completed 246 real tasks on their own repositories — mature codebases averaging over a million lines that the developers had worked on for an average of five years. Tasks were randomly assigned to allow or disallow AI tools.

The result: developers using AI tools were 19% slower at completing tasks. Not faster. Slower.

But here’s the finding that should keep you up at night: before the study, developers predicted AI would make them 24% faster. After the study — after experiencing the actual slowdown — they still estimated they had been 20% faster. That’s a 40+ percentage point gap between perceived and actual productivity.

They weren’t just wrong about the magnitude. They were wrong about the direction.

Why It Happened

METR’s analysis identified causes that map directly to the infrastructure problems I describe in the full paper:

Context familiarity penalty. Developers who knew their codebases most deeply were slowed down most. The more an expert knows that the AI doesn’t have access to, the more time they spend correcting confident mistakes. This is the institutional knowledge problem — the gap between what lives in a developer’s head and what the agent can see.

Quality standards as friction. Code that is “functionally correct” but fails implicit quality standards requires substantial human cleanup. This is the rework problem.

Attention fragmentation. AI tools introduced micro-interruption patterns that disrupted flow state. The tool that was supposed to reduce cognitive load increased it.

What to Measure Instead

The established measurement frameworks — DORA, SPACE, DevEx — provide the scaffolding, but each needs adaptation for agents. Here’s the thinking tool I use:

Net agent value ≈ throughput gain − review burden − CI/compute cost − regression/rework cost − coordination overhead

This isn’t a formula to compute precisely. It’s a frame that makes the cost structure visible and prevents the common mistake of measuring only the numerator while ignoring the denominator.

A few metrics that actually matter:

Rework Rate (DORA’s fifth metric, benchmarked 2025) — the ratio of deployments that are unplanned responses to production incidents. Track it by source: agent vs. human, by agent type, by code area. If agent-generated code drives up rework rate, the speed gain is illusory.

Autonomy Duration — how far an agent progresses through meaningful work before requiring human intervention. The primary measure of macro-loop efficiency.

Quality-adjusted throughput — throughput weighted by change failure rate, rework rate, and code health. An agent that produces five low-quality PRs requiring multiple review rounds is less valuable than one that produces two clean PRs that merge on first review.

Marginal value decay — as agent count increases, does each additional agent produce proportional value? Or do coordination costs, CI contention, and review bottlenecks produce diminishing returns? Plot value per agent against agent count and look for the inflection point.

Metrics That Actively Mislead

“Compiles and passes tests” is just as dangerous as lines-of-code. In the full paper, I include a case study of an LLM-generated reimplementation of SQLite — 576,000 lines of Rust. It compiled. It passed its tests. A basic primary key lookup on 100 rows took 1,815 milliseconds. The same operation in SQLite takes 0.09 milliseconds. It was 20,171x slower — because it was missing a single physical storage optimization that someone profiled against a real workload decades ago.

That code would have scored perfectly on every activity metric and every “does it work” gate. Only outcome-level measurement — a performance benchmark — revealed the gap.

The Unsolved Problem

There’s one critical dimension that no current framework measures well: ownership attribution. Who understands a given piece of code well enough to debug and maintain it? In a world where agents produce code and humans review it under volume pressure, the assumption that “the reviewer owns it” becomes increasingly fragile.

If no human wrote the code and the reviewer only skimmed it, who actually understands the system well enough to debug it when it breaks at 3 AM?

This is an open problem. But the first step is acknowledging that the metrics most organizations are using today — the ones that feel like measurement — aren’t measuring what matters.

Without instrumented, outcome-based measurement, organizations will deploy agents, feel faster, and never discover they are slower — until rework rate and regression data make the cost undeniable.

The full paper goes deeper on the measurement framework — adapted versions of DORA, SPACE, and DevEx for agentic workflows, and a detailed experimental agenda for testing what actually works.

Read the full paper →

The Narrow Pipe

Roger Fleig — Wed, 25 Mar 2026 05:45:47 GMT

I published a position paper today called The Narrow Pipe — it’s long, opinionated, and covers a lot of ground. This post is the short version of the thesis, and the first in a series that unpacks the key ideas.

Here’s the core observation: the cost of producing code is falling fast. But the downstream infrastructure that reviews, builds, tests, integrates, and deploys that code? It was designed for human-speed throughput.

This is the Theory of Constraints applied to software engineering. For decades, the constraint was producing code. Now that agents are dissolving that constraint, the bottleneck doesn’t disappear — it shifts downstream, and every existing limitation in the pipeline becomes the new critical path.

The Agent Loop

To reason about this concretely, it helps to define two loops:

The micro-loop is the agent’s internal cycle: reason, act, observe, repeat. Each iteration burns tokens, time, and potentially infrastructure — a test run, a build, a search query. Every problem in agentic infrastructure either inflates or deflates this loop count. A flaky test that sends the agent chasing a phantom failure? That’s wasted micro-loops. Poor context injection that omits a key dependency? More wasted micro-loops.

The macro-loop is the roundtrip between agent and human — a review comment, an escalation, a course correction. This is where human judgment enters the system. The key metric here is autonomy duration: how far an agent can progress through meaningful work before it needs a human. Longer autonomy duration means fewer macro-loops and less human bottleneck. But longer autonomy without guardrails means compounding errors.

The Narrow Pipe, restated: agents squeeze the pipeline from both directions simultaneously. The micro-loop hammers shared infrastructure at 10–100x the rate it was designed for. The macro-loop stays human-speed but now faces a volume of agent-produced work that no review process was built to absorb.

The Autonomy Spectrum

Dan Shapiro mapped the progression of AI-assisted development onto five levels — modeled on the NHTSA’s driving automation levels. The analogy is intentional: each level shifts who is driving.

Level 1 — Autocomplete. AI suggests completions. Human is driving.
Level 2 — Pair programming. Developer and AI collaborate. This is where most “AI-native” teams operate today. Critically, every level from here on feels like the ceiling. It is not.
Level 3 — Code review. AI writes the code; the developer reviews it. The developer’s job shifts from senior engineer to engineering manager. This is where the Narrow Pipe thesis bites hardest.
Level 4 — Spec-driven. Engineers write specifications. Agents write code. Other agents test it. Hours later, humans check the results. The job shifts from how to what.
Level 5 — The Dark Factory. Specs go in, software comes out.

My paper focuses on building the infrastructure for Levels 3 and 4 to work reliably at scale — particularly in mature monorepos, where the structural challenges make every level transition harder.

Why “More Code” Isn’t the Goal

There’s a subtler risk beyond throughput. Without deliberate reinforcement, the pipeline doesn’t merely constrict the flow of agent-generated code — it allows institutional quality to leak out over time. Review rigor erodes under volume pressure. Agent output becomes the context for future agents. Patterns simplify. Critical details — the kind that only accumulate through years of profiling against real workloads — quietly disappear.

The goal is not “more code.” It’s ensuring that agentic speed translates into business outcomes without eroding the architectural stability that makes a codebase maintainable over years.

What’s in the Full Paper

The Narrow Pipe maps eight interdependent problem areas (runtime isolation, agent identity, graduated autonomy, coordination, context injection, institutional knowledge, test reliability, and blast radius awareness), proposes a measurement framework adapted from DORA, SPACE, and DevEx, and lays out an experimental agenda for making progress with evidence rather than intuition.

It also includes a detailed case study of Stripe’s Minions system — the most concrete public example of enterprise-scale agentic development — and a cautionary case study of an LLM-generated SQLite reimplementation that compiled, passed its tests, and was 20,000x slower on a basic operation.

The infrastructure that governs code quality must be rebuilt for a world where the volume of changes exceeds what human judgment alone can govern. The paper is my attempt to map that problem space rigorously.

More posts in this series coming soon — next up: why your agent productivity metrics are probably lying to you.

Read the full paper →