Most Developer Productivity Metrics Measure the Wrong Thing

My code coverage is 100%. Granted, most of the tests just assert that true === true, but the dashboard looks fantastic.

There's a particular flavour of corporate brain damage that emerges when someone in leadership discovers you can put numbers on a dashboard and those numbers can go up. The moment that happens, the numbers become the point. The actual thing the numbers were supposed to represent quietly exits through the back door.

Developer productivity measurement is where this phenomenon goes to thrive.

The Metrics Industrial Complex

DORA metrics — deployment frequency, lead time, change failure rate, time to restore — are probably the least bad framework we have. Nicole Forsgren's research across 31,000+ professionals found that elite performers deploy 208 times more frequently than low performers, with 106 times faster lead times. The correlation with organisational performance (profitability, market share) is real and well-documented.

But here's what gets lost: DORA metrics were designed as diagnostics, not targets. The moment you stick them on a team's performance review, you get deployment frequency gaming — no-op deploys, feature flags with no features behind them, splitting releases into meaninglessly small increments. The dashboard looks elite. The engineering organisation hasn't changed.

Charles Goodhart nailed this in 1975: when a measure becomes a target, it ceases to be a good measure. Software engineering has been speedrunning this principle ever since.

Lines of Code: The Metric That Won't Die

You'd think we'd have learned from IBM in the 1980s. Their Federal Systems Division measured programmer productivity in lines of code per day. The team building NASA's Space Shuttle avionics software was producing roughly 10 lines per day per developer. By LoC standards, this looked terrible. The code had a defect rate of 0.1 errors per 1,000 lines — some of the most reliable software ever written.

Bill Atkinson at Apple once submitted a weekly report showing -2,000 lines of code. He'd refactored a graphics routine to be faster and shorter. Under LoC metrics, that was his worst week. It was probably his most valuable.

Bill Gates reportedly said measuring software productivity by lines of code is like measuring progress on an aeroplane by how much it weighs. And yet, organisations still track this. In 2026. With straight faces.

Velocity Is a Planning Tool, Not a Score

Story points were invented by Ron Jeffries as a relative sizing mechanism for sprint planning. He's since said, and I'm paraphrasing gently, that he regrets it. His actual words: "I may have invented story points, and if I did, I'm sorry now."

The problem isn't story points themselves. It's what happens when management discovers them. Pivotal Labs found that teams whose velocity was tracked by management showed velocity increases of 30-50% over six months with no corresponding increase in features delivered. Teams just re-estimated stories at higher point values. The metric went up. Nothing else changed.

Martin Fowler put it plainly: the moment management uses velocity to compare teams or set targets, the metric becomes useless. And comparing velocity across teams is statistically meaningless anyway — a "5" on one team bears no relation to a "5" on another.

Cycle Time: Better, but Not Enough

Cycle time — the duration from starting work to deploying it — is borrowed from lean manufacturing and it's genuinely useful as a flow metric. Don Reinertsen made a compelling case that shorter cycle times reduce cost of delay and enable faster learning. LinearB published data showing teams with PR cycle times under 2 days shipped 3.5x more features per quarter than those over 7 days.

Cycle time is better than most metrics. But it has the same vulnerability: it measures speed without direction. You can have brilliant cycle time while shipping the wrong things. It penalises complex work — a team doing a hard architectural refactoring will have longer cycle times than one cranking out CRUD endpoints. The refactoring might be orders of magnitude more valuable.

And when cycle time becomes a target, people start skipping code reviews and cutting testing to merge faster. You've optimised the metric and degraded the system.

What Actually Predicts Effectiveness

Google spent two years studying 180+ internal teams for Project Aristotle. The number one predictor of team effectiveness wasn't individual talent, seniority, team size, or any engineering metric. It was psychological safety — the belief that you won't be punished for making a mistake or speaking up.

Forsgren's research found that generative cultures (high cooperation, shared responsibility, failure leading to inquiry rather than blame) were 2.4x more likely to produce high-performing teams. The 2023 DevEx research from Microsoft showed that developer satisfaction and self-rated productivity correlate more strongly with business outcomes than any activity-based metric. Developers who reported high satisfaction were 47% more likely to report high productivity.

The pattern is consistent across decades of research: the things that actually predict engineering effectiveness are cultural and systemic. Psychological safety. Developer experience. Cognitive load. Flow state. Feedback loop speed. None of these fit neatly on a dashboard, which is precisely why they get ignored in favour of commit counts.

McKinsey Tried Anyway

In August 2023, McKinsey published "Yes, you can measure software developer productivity." The engineering community's response was immediate and unified in a way that rarely happens.

Kent Beck warned he'd seen this film before — every attempt to measure individual developer output has ended badly. Gergely Orosz noted it was clearly written by people who'd never managed a software team. Charity Majors, CTO of Honeycomb, was more direct: "Every time a management consultancy tries to measure developer productivity, an engineering org loses its soul."

The core issue: McKinsey was attempting to commoditise engineering labour into interchangeable units with measurable output. This is the exact framing that decades of research — from Brooks's The Mythical Man-Month in 1975 to SPACE in 2021 — has been pushing against.

The Uncomfortable Truth

The reason bad metrics persist isn't that better alternatives don't exist. It's that the better alternatives are harder to sell to a board. "We measure psychological safety through quarterly surveys and track developer experience across flow state, feedback loops, and cognitive load" doesn't have the same ring as "our deployment frequency is up 40%."

If you're measuring developer productivity, measure the system, not the individuals. Use DORA as a diagnostic, not a target. Pay attention to developer satisfaction — it's a leading indicator, not a feel-good vanity metric. And when someone shows you a dashboard where all the numbers are going up, ask what's not on the dashboard.

Sam Newman said it best: "The moment you tie a metric to someone's performance review, you are no longer measuring what you think you're measuring. You are measuring people's ability to optimise for the metric."

The best engineering teams don't have the best metrics. They have the fewest reasons to game them.