Estimation in an era of exploding variance

The more advanced a control system is, the more crucial may be the contribution of the human operator.— Lisanne Bainbridge · Ironies of Automation, 1983

Two scenes have coexisted in this industry for a long time. In one, developers are buried in overtime. In the other, every other role complains that developers are the only ones with too much slack. They look like opposites, but they're the same problem wearing two faces. The labor-hour model can't handle variance.

AI made the variance worse. Plenty of devs now finish a day's work in an hour or two. Self-deprecating posts like "I feel bad for my company" show up regularly on anonymous forums like r/overemployed. (These companies are usually still pricing work like it's pre-AI.)

The labor-hour model is a relic

Salary, hourly rate, contractor quote, rate negotiation — every way we price labor reduces to one thing: turning work into time. Price = labor-hours × rate. Full-time employment bundles those hours into 8-hour blocks sold by the month; contracting splits the same units and sells them by the hour. Bundle or piece, the unit is still time.

In a factory, this works honestly enough. If one hour produces one part, the next hour produces about the same. Pricing by time becomes statistically reasonable. Time as a unit naturally fits low-variance work.

Software was never like that. One hour produces zero, the next produces a hundred, even within the same person on the same day. The labor-hour model can't deal with this variance directly, so it pegs price to the mean of the distribution and tacks on a 20–30% buffer. That nudges the average rightward, but there's no telling whether the tails stay inside the buffer.

So where does the variance go? Someone absorbs it. Bargaining power decides who. When the worker is weak, it gets absorbed in overtime and lower quality. When the company is weak, it gets absorbed in slack schedules and waste. Either way, it's inefficiency. The two scenes from above are exactly that.

That said, calling it a pure flaw is too easy. The labor-hour estimate is also a lie both sides agree to. Quote a client honestly with "9–12 days, 80% confidence" and in practice negotiation anchors at 12. The moment you bake variance into price, the average estimate goes up. The point estimate "5 days" works as a polite fiction that lowers transaction costs. Strip it out and the deal stops moving. So this model needs patching, not scrapping.

Agile¹ and story points² were a step toward thinking in distributions. The idea was to measure by relative complexity instead of time. Using a sequence like 1, 2, 3, 5, 8³ followed the same logic — wider gaps for larger units, on purpose, so you'd accept fuzziness instead of pretending to a precision you don't have. But in practice, points devolved into a time index ("1 point = half a day"), and "30 points this sprint" was just a point estimate again. Even Ron Jeffries, who arguably invented them, regrets it.⁴ The shadow of the point estimate never went away. (One fintech I worked at had a Story Points field on every Jira ticket and nobody filled it in. Just figuring out the points was the hard part lol.)

Three currents in software estimation

(1) Drop the point estimate entirely: #NoEstimates

This camp is hardline⁵. "Don't estimate. Count." Use historical data to simulate the future and produce a distribution. The answer isn't "how many days." It's "for this kind of work, our team finishes in 9–12 days with 80% confidence."

This is the most coherent position. If point estimates on high-variance work are inherently lies, distributions are honest.

The trouble isn't just bargaining power. When a company quotes externally, the client wants to hear "5 days," not "9–12 days, 80%." Domains where the market itself forces point estimates — fixed-bid contracts, government tenders, marketing campaigns with hard deadlines — still account for a huge chunk of the work out there. And distributions become bottlenecks for downstream work. Dependent teams or plans can't sit on top of "9–12 days."

(2) Cut scope upfront in writing: Writing-First

Basecamp's Shape Up⁶ and Amazon's Working Backwards / PR-FAQ⁷ are the canonical examples — and they look pretty different. Shape Up fixes time at 6-week cycles and a betting table; if it doesn't ship in 6 weeks, you cut scope or push to the next cycle. PR-FAQ writes an imagined press release before a single line of code. But underneath, they do the same thing. They cut scope by writing, before code.

Isn't that just Agile? Half right. Shape Up calls itself "post-Agile" and sits in the Agile family. PR-FAQ, with its "polished doc before code," is closer to BDUF (Big Design Up Front) — exactly what Agile pushed back against. The real shared core has nothing to do with the Agile vs Waterfall axis. What they share is that scope itself gets cut by writing, before variance has a chance to grow. Story points tried to measure variance. Writing-First cuts it before it shows up.

Stripe's memo culture⁸, Shopify's GSD⁹, and Spotify's DIBBs¹⁰ do the same thing at the level of decision-making. Cut by writing first, then execute once there's consensus.

(3) Catch it after the fact with data: DORA, METR, GitClear

The freshest current. People are starting to measure AI's impact quantitatively. These sources don't agree, and their conclusions shift every year. That's how fast things are moving.

The DORA 2025¹¹ report finds that with AI adoption hitting 90%, AI correlates positively with throughput and product performance, but still negatively with delivery stability. The faster code gets written, the more often systems break. DORA put it this way: AI doesn't fix a team, it amplifies what's already there. Strong teams get stronger, weak teams get more exposed.

METR's RCT from July 2025¹² is more provocative. They split 16 experienced open-source developers into two groups, allowed AI tools in one and forbade them in the other, and measured actual time. Result: the AI group was 19% slower. The participants themselves predicted AI would speed them up by 24%, and even after the experiment they estimated 20% speedup. The gap between perception and measurement was almost 40 percentage points.

(I usually accept reputable studies like this, but as a working developer I can't quite swallow "they were slower." The sample is only 16, and the study used Sonnet 3.5/3.7 — real limitations.)

METR seems to have had similar doubts. They ran a follow-up¹³ with 57 developers starting August 2025. Results dropped on February 24, 2026, and the conclusion was telling. The data can't be trusted. Developers didn't submit "tasks I wouldn't want to do without AI" — 30–50% of participants admitted as much. One participant said it plainly: "I avoid issues where AI could finish in 2 hours but I'd need 20." A formal RCT collapsed under selection bias. METR only said qualitatively that "AI is likely speeding things up by early 2026."

The GitClear 2025 report¹⁴ analyzed 211M lines of code and found that code churn (code reverted within 2 weeks of being written) nearly doubled, from 3.1% in 2020 to 5.7% in 2024. Duplicated code blocks also went up 8x. Write fast, delete just as fast, paste the same thing in multiple places. Quantitative evidence of the cognitive gap.

The fact that these three sources don't point in one direction is itself a signal. Measurement is still hard, and the methodology keeps breaking. AI shifts task distributions, developers shift task selection, and since you do other work while agents run, the unit of work-time itself is shaking. Just like the labor-hour model can't pin variance, academic measurement is hitting the same wall.

What LLMs added to software estimation

LLMs gave the variance these three currents were dealing with one more push.

Spike average dropped. Variance grew.

Average-case work — scaffolding, prototypes, entering a new domain — gets eaten fast by LLMs. An hour's job finishes in 7 minutes. Spikes¹⁵ got cheaper, too.

But the tails stayed the same or got worse. Integration, verification, edge cases, changes pushed into a running system. As LLMs lowered the average, they seem to have widened the variance along with it. Hard to say definitively — DORA, METR, and GitClear all measure averages, not variance directly. The widening variance is more field intuition than measured fact. But the sense that work far from the average has gotten farther is sharper in practice.

How does this hit estimation? "3.25 days" is more of a lie than ever. The same 3-day job, with an LLM, finishes in half a day sometimes and eats a week other times. Point estimates lose more meaning.

The cognitive gap: Bainbridge wrote it in 1983

Lisanne Bainbridge, a British industrial psychologist, wrote a short 1983 paper in Automatica called Ironies of Automation¹⁶. She watched what happens to human operators when automation enters places like aviation and nuclear control rooms.

As automation increases, humans lose their grip on the system. And only the work that's hard to automate — the exceptional, the genuinely hard — is left for humans to do.

She named four ironies, and they map cleanly onto a codebase:

The errors of designers — here, the people who made the LLM training data — eventually get caught by humans in the trenches. Ask an LLM for an estimate and it'll inherit the conservative pattern in its training data and give you a wildly padded estimate.
The difficulty of the leftover work goes up. Once LLMs handle the easy stuff, what's left for developers is on average harder.
Skill atrophy. If you don't use it, you lose it. Debugging becomes that.
Out-of-the-loop performance problem. LLMs do everything most of the time, but the moment a system breaks, the human has no context.

In an earlier post I wrote that "reviewing AI's output becomes its own work, and even though more gets done, dense days leave you physically exhausted." The academic name for that fatigue is out-of-the-loop performance problem. It was already defined in 1983.

LLMs don't make developer work easier. They make what's left harder.

So which approach wins in the LLM era

Of the three currents, which absorbs the LLM era best? Personally I'm pulled toward Writing-First. Specifically, a hybrid of upfront spec-out and cycle-internal spec-in.

#NoEstimates is the most coherent and honest answer in an exploding-variance era, but it's weak in external negotiation. The data-driven path could turn measurement itself into new infrastructure, but as we saw, the methodology keeps breaking. DORA, GitClear, METR are all measuring the same thing through different lenses, and none of them converge.

If measurement keeps breaking, the only move left is to cut by negotiation upfront. Writing-First works for a simple reason. Most of the world runs on the art of negotiation. Estimation is guessing, and how a guess gets accepted depends less on its intrinsic accuracy than on trust in the person or team making it. The dev who pads every estimate, the one who sets tight deadlines and actually hits them, the one who overestimates their own speed and misses every time. The same "5 days" lands differently depending on who said it. Cutting variance by negotiation upfront is more realistic than measuring it after, especially as variance explodes.

And spec-ing out is severely underrated as a discipline. "Cut by writing before code" used to get mocked as Waterfall, but it's the same move Amazon revived with PR-FAQ, Basecamp with Shape Up, Stripe with memo culture. It's the most realistic tool for cutting variance one step ahead. When LLMs make code cheap, the time spent figuring out what to build becomes more valuable. Code got cheap, so badly-cut scope hurts more.

What's interesting in the LLM era is that spec in also grows. To be honest, I haven't heard this term used in industry. Specs only ever flowed outward, toward cutting. Code was expensive. Now that code is nearly free, more discoveries surface during implementation: "here we should go deeper, more precisely." What used to be processed under negative labels like "additional requirements" or "scope changes" might actually be the most valuable moments of discovery.

(I've worked at a Writing-First company myself. Small details would naturally get sorted at the Pull Request stage, but reviewers often judged something "should have been confirmed at PR-FAQ" and approval would stall for days.)

Once you accept that, pure Writing-First wobbles a bit. It's a kind of admission that variance can't all be cut upfront — variance keeps popping up throughout the cycle. So really it's closer to a hybrid of #NoEstimates and Writing-First. Cut upfront in writing, and inside the cycle, renegotiate as you go. Not one negotiation upfront, but negotiation throughout.

There are caveats. Shape Up's 6-week cycle might just be too long in an era where LLMs shortened spikes. 2 weeks? 1 week? Recalibration may be needed. And cutting by writing requires negotiation power. In organizations heavy with hierarchy or running on ad-hoc decisions over Slack and verbal agreements, writing one PR-FAQ can take a month. These orgs aren't short on negotiation power. They're short on the authority to cut by writing. The adoption cost isn't small.

Still, while #NoEstimates and the data-driven current are building new measurement infrastructure, Writing-First has a negotiation tool that already works and just needs to be tuned for the LLM era.

Closing

Since I started my career, I've wanted to be a developer who estimates well. I saw it as the fastest path to earning my tech lead's trust. That hasn't changed. What's shifted is the unit of my attention — from the individual to the team. As the era changes, I think this still matters. I'm just trying to do that as well as I can from where I stand.