Estimation in an era of exploding variance
The more advanced a control system is, the more crucial may be the contribution of the human operator.
Two scenes have coexisted in this industry for a long time. In one, developers are buried in overtime. In the other, every other role complains that developers are the only ones with too much slack. They look like opposites, but they're the same problem wearing two faces. The labor-hour model can't handle variance.
AI made the variance worse. Plenty of devs now finish a day's work in an hour or two. Self-deprecating posts like "I feel bad for my company" show up regularly on anonymous forums like r/overemployed. (These companies are usually still pricing work like it's pre-AI.)
The labor-hour model is a relic
Salary, hourly rate, contractor quote, rate negotiation — every way we price labor reduces to one thing: turning work into time. Price = labor-hours × rate. Full-time employment bundles those hours into 8-hour blocks sold by the month; contracting splits the same units and sells them by the hour. Bundle or piece, the unit is still time.
In a factory, this works honestly enough. If one hour produces one part, the next hour produces about the same. Pricing by time becomes statistically reasonable. Time as a unit naturally fits low-variance work.
Software was never like that. One hour produces zero, the next produces a hundred, even within the same person on the same day. The labor-hour model can't deal with this variance directly, so it pegs price to the mean of the distribution and tacks on a 20–30% buffer. That nudges the average rightward, but there's no telling whether the tails stay inside the buffer.
So where does the variance go? Someone absorbs it. Bargaining power decides who. When the worker is weak, it gets absorbed in overtime and lower quality. When the company is weak, it gets absorbed in slack schedules and waste. Either way, it's inefficiency. The two scenes from above are exactly that.
That said, calling it a pure flaw is too easy. The labor-hour estimate is also a lie both sides agree to. Quote a client honestly with "9–12 days, 80% confidence" and in practice negotiation anchors at 12. The moment you bake variance into price, the average estimate goes up. The point estimate "5 days" works as a polite fiction that lowers transaction costs. Strip it out and the deal stops moving. So this model needs patching, not scrapping.
Agile1 and story points2 were a step toward thinking in distributions. The idea was to measure by relative complexity instead of time. Using a sequence like 1, 2, 3, 5, 83 followed the same logic — wider gaps for larger units, on purpose, so you'd accept fuzziness instead of pretending to a precision you don't have. But in practice, points devolved into a time index ("1 point = half a day"), and "30 points this sprint" was just a point estimate again. Even Ron Jeffries, who arguably invented them, regrets it.4 The shadow of the point estimate never went away. (One fintech I worked at had a Story Points field on every Jira ticket and nobody filled it in. Just figuring out the points was the hard part lol.)
Three currents in software estimation
(1) Drop the point estimate entirely: #NoEstimates
This camp is hardline5. "Don't estimate. Count." Use historical data to simulate the future and produce a distribution. The answer isn't "how many days." It's "for this kind of work, our team finishes in 9–12 days with 80% confidence."
This is the most coherent position. If point estimates on high-variance work are inherently lies, distributions are honest.
The trouble isn't just bargaining power. When a company quotes externally, the client wants to hear "5 days," not "9–12 days, 80%." Domains where the market itself forces point estimates — fixed-bid contracts, government tenders, marketing campaigns with hard deadlines — still account for a huge chunk of the work out there. And distributions become bottlenecks for downstream work. Dependent teams or plans can't sit on top of "9–12 days."
(2) Cut scope upfront in writing: Writing-First
Basecamp's Shape Up6 and Amazon's Working Backwards / PR-FAQ7 are the canonical examples — and they look pretty different. Shape Up fixes time at 6-week cycles and a betting table; if it doesn't ship in 6 weeks, you cut scope or push to the next cycle. PR-FAQ writes an imagined press release before a single line of code. But underneath, they do the same thing. They cut scope by writing, before code.
Isn't that just Agile? Half right. Shape Up calls itself "post-Agile" and sits in the Agile family. PR-FAQ, with its "polished doc before code," is closer to BDUF (Big Design Up Front) — exactly what Agile pushed back against. The real shared core has nothing to do with the Agile vs Waterfall axis. What they share is that scope itself gets cut by writing, before variance has a chance to grow. Story points tried to measure variance. Writing-First cuts it before it shows up.
Stripe's memo culture8, Shopify's GSD9, and Spotify's DIBBs10 do the same thing at the level of decision-making. Cut by writing first, then execute once there's consensus.
(3) Catch it after the fact with data: DORA, METR, GitClear
The freshest current. People are starting to measure AI's impact quantitatively. These sources don't agree, and their conclusions shift every year. That's how fast things are moving.
The DORA 202511 report finds that with AI adoption hitting 90%, AI correlates positively with throughput and product performance, but still negatively with delivery stability. The faster code gets written, the more often systems break. DORA put it this way: AI doesn't fix a team, it amplifies what's already there. Strong teams get stronger, weak teams get more exposed.
METR's RCT from July 202512 is more provocative. They split 16 experienced open-source developers into two groups, allowed AI tools in one and forbade them in the other, and measured actual time. Result: the AI group was 19% slower. The participants themselves predicted AI would speed them up by 24%, and even after the experiment they estimated 20% speedup. The gap between perception and measurement was almost 40 percentage points.
(I usually accept reputable studies like this, but as a working developer I can't quite swallow "they were slower." The sample is only 16, and the study used Sonnet 3.5/3.7 — real limitations.)
METR seems to have had similar doubts. They ran a follow-up13 with 57 developers starting August 2025. Results dropped on February 24, 2026, and the conclusion was telling. The data can't be trusted. Developers didn't submit "tasks I wouldn't want to do without AI" — 30–50% of participants admitted as much. One participant said it plainly: "I avoid issues where AI could finish in 2 hours but I'd need 20." A formal RCT collapsed under selection bias. METR only said qualitatively that "AI is likely speeding things up by early 2026."
The GitClear 2025 report14 analyzed 211M lines of code and found that code churn (code reverted within 2 weeks of being written) nearly doubled, from 3.1% in 2020 to 5.7% in 2024. Duplicated code blocks also went up 8x. Write fast, delete just as fast, paste the same thing in multiple places. Quantitative evidence of the cognitive gap.
The fact that these three sources don't point in one direction is itself a signal. Measurement is still hard, and the methodology keeps breaking. AI shifts task distributions, developers shift task selection, and since you do other work while agents run, the unit of work-time itself is shaking. Just like the labor-hour model can't pin variance, academic measurement is hitting the same wall.
What LLMs added to software estimation
LLMs gave the variance these three currents were dealing with one more push.
Spike average dropped. Variance grew.
Average-case work — scaffolding, prototypes, entering a new domain — gets eaten fast by LLMs. An hour's job finishes in 7 minutes. Spikes15 got cheaper, too.
But the tails stayed the same or got worse. Integration, verification, edge cases, changes pushed into a running system. As LLMs lowered the average, they seem to have widened the variance along with it. Hard to say definitively — DORA, METR, and GitClear all measure averages, not variance directly. The widening variance is more field intuition than measured fact. But the sense that work far from the average has gotten farther is sharper in practice.
How does this hit estimation? "3.25 days" is more of a lie than ever. The same 3-day job, with an LLM, finishes in half a day sometimes and eats a week other times. Point estimates lose more meaning.
The cognitive gap: Bainbridge wrote it in 1983
Lisanne Bainbridge, a British industrial psychologist, wrote a short 1983 paper in Automatica called Ironies of Automation16. She watched what happens to human operators when automation enters places like aviation and nuclear control rooms.
As automation increases, humans lose their grip on the system. And only the work that's hard to automate — the exceptional, the genuinely hard — is left for humans to do.
She named four ironies, and they map cleanly onto a codebase:
- The errors of designers — here, the people who made the LLM training data — eventually get caught by humans in the trenches. Ask an LLM for an estimate and it'll inherit the conservative pattern in its training data and give you a wildly padded estimate.
- The difficulty of the leftover work goes up. Once LLMs handle the easy stuff, what's left for developers is on average harder.
- Skill atrophy. If you don't use it, you lose it. Debugging becomes that.
- Out-of-the-loop performance problem. LLMs do everything most of the time, but the moment a system breaks, the human has no context.
LLMs don't make developer work easier. They make what's left harder.
So which approach wins in the LLM era
Of the three currents, which absorbs the LLM era best? Personally I'm pulled toward Writing-First. Specifically, a hybrid of upfront spec-out and cycle-internal spec-in.
#NoEstimates is the most coherent and honest answer in an exploding-variance era, but it's weak in external negotiation. The data-driven path could turn measurement itself into new infrastructure, but as we saw, the methodology keeps breaking. DORA, GitClear, METR are all measuring the same thing through different lenses, and none of them converge.
If measurement keeps breaking, the only move left is to cut by negotiation upfront. Writing-First works for a simple reason. Most of the world runs on the art of negotiation. Estimation is guessing, and how a guess gets accepted depends less on its intrinsic accuracy than on trust in the person or team making it. The dev who pads every estimate, the one who sets tight deadlines and actually hits them, the one who overestimates their own speed and misses every time. The same "5 days" lands differently depending on who said it. Cutting variance by negotiation upfront is more realistic than measuring it after, especially as variance explodes.
And spec-ing out is severely underrated as a discipline. "Cut by writing before code" used to get mocked as Waterfall, but it's the same move Amazon revived with PR-FAQ, Basecamp with Shape Up, Stripe with memo culture. It's the most realistic tool for cutting variance one step ahead. When LLMs make code cheap, the time spent figuring out what to build becomes more valuable. Code got cheap, so badly-cut scope hurts more.
What's interesting in the LLM era is that spec in also grows. To be honest, I haven't heard this term used in industry. Specs only ever flowed outward, toward cutting. Code was expensive. Now that code is nearly free, more discoveries surface during implementation: "here we should go deeper, more precisely." What used to be processed under negative labels like "additional requirements" or "scope changes" might actually be the most valuable moments of discovery.
(I've worked at a Writing-First company myself. Small details would naturally get sorted at the Pull Request stage, but reviewers often judged something "should have been confirmed at PR-FAQ" and approval would stall for days.)
Once you accept that, pure Writing-First wobbles a bit. It's a kind of admission that variance can't all be cut upfront — variance keeps popping up throughout the cycle. So really it's closer to a hybrid of #NoEstimates and Writing-First. Cut upfront in writing, and inside the cycle, renegotiate as you go. Not one negotiation upfront, but negotiation throughout.
There are caveats. Shape Up's 6-week cycle might just be too long in an era where LLMs shortened spikes. 2 weeks? 1 week? Recalibration may be needed. And cutting by writing requires negotiation power. In organizations heavy with hierarchy or running on ad-hoc decisions over Slack and verbal agreements, writing one PR-FAQ can take a month. These orgs aren't short on negotiation power. They're short on the authority to cut by writing. The adoption cost isn't small.
Still, while #NoEstimates and the data-driven current are building new measurement infrastructure, Writing-First has a negotiation tool that already works and just needs to be tuned for the LLM era.
Closing
Since I started my career, I've wanted to be a developer who estimates well. I saw it as the fastest path to earning my tech lead's trust. That hasn't changed. What's shifted is the unit of my attention — from the individual to the team. As the era changes, I think this still matters. I'm just trying to do that as well as I can from where I stand.
- A software development philosophy that started with the 2001 Agile Manifesto. Short cycles, responding to change, working software first. Scrum, Kanban, XP are common practices. Agile Manifesto (2001)
- A unit for estimating size/complexity in Agile/Scrum. Uses relative complexity instead of time. Typically Fibonacci (1, 2, 3, 5, 8, 13...). Story Points (Atlassian)
- 1kg vs 2kg is distinguishable by hand, but 20kg vs 21kg isn't. Larger work has less precision in human cognition — the device deliberately widens unit gaps to accept fuzziness instead of false precision. Why The Fibonacci Sequence Works Well for Estimating
- 'I may have invented story points, and if I did, I'm sorry now.' A piece where the inventor of the concept regrets how it got distorted. Story Points Revisited (2019)
- Notable proponents include Allen Holub, Vasco Duarte, Daniel Vacanti, Troy Magennis.
- Basecamp's product development methodology. 6-week cycles + 2-week cooldown, fixed time variable scope. 'Shaping' (cutting upfront) before 'building'. Shape Up by Ryan Singer
- Amazon's product planning method. Before any code, write an imagined launch press release (PR) and FAQ. Working backwards from the customer. Working Backwards by Werner Vogels
- Stripe's decision-making culture of writing decisions first and having peers critique in writing. Memos build consensus before meetings.
- Shopify's product development framework 'Get Shit Done'. Stages: Proposal → Prototype → Build → Release → Iterate.
- Spotify's strategic planning framework. Structures decisions as Data → Insights → Beliefs → Bets.
- Short for DevOps Research and Assessment. A research group started by Nicole Forsgren et al. that publishes annual 'State of DevOps' reports. Hosted by Google Cloud. Authors of the book 'Accelerate'. DORA 2025 Report
- METR (Model Evaluation & Threat Research) is a non-profit for AI safety/capability evaluation. The study was an RCT (randomized controlled trial) on 16 open-source developers split into two groups. Measuring Early-2025 AI Productivity (METR)
- A follow-up experiment with a larger sample to re-validate the first study's reliability. Concluded the data couldn't be trusted due to selection bias. Uplift Update (METR)
- GitClear is a company that builds code change analysis tools. They publish an annual AI code quality report. AI Copilot Code Quality 2025 (GitClear)
- A term from Kent Beck's XP. A short, time-boxed exploration of areas where estimation itself is hard. 'Like driving a nail straight into a log — narrow but deep.' Used in this post loosely as 'a short exploration to reduce uncertainty before main work.'
- Bainbridge's 1983 short paper that laid out the paradoxes of automation. Core claim: as automation grows, the difficulty of residual human work goes up. Ironies of Automation (Wikipedia)