The Metric That Ate the System

Why the systems we build to prevent failure become the systems that guarantee it — and why the fastest organizations have the fewest failures

Mar 22, 2026

The onboarding took three and a half months.

Nobody thought that was a problem. The product was a mid-market SaaS platform with several hundred enterprise accounts, but the implementation looked more like a consulting engagement. Hundreds of configuration fields. Custom JSON editing. Bespoke HTML templates for every client. A sales engineer could hand a customer practically anything they asked for — a UI layout that matched their brand guidelines, a workflow that mirrored their internal process, a data model shaped to their specific edge case. Customers loved it. They got exactly what they wanted. Each one felt like the product had been built for them alone.

The flexibility was the pitch. It was also the thing that was quietly converting a software business into a professional services firm. Every bespoke configuration added hours to onboarding. Every custom template required an engineer who understood that specific customer’s setup. Renewals required annual configuration updates that consumed weeks of the customer’s time, weeks of the team’s time, tested against a setup so customized that no two deployments looked alike. The product could do anything. Getting it to do the specific thing a customer needed took longer every year.

Our team built the replacement. Opinionated where the old system was open-ended. Fewer configuration fields. Sensible defaults. A product that behaved like a product. The tradeoff was obvious: some customers would lose the exact UI tweak they’d requested two years ago. Some workflows would standardize where they’d previously bent. The new system could onboard a customer in weeks instead of months.

The legacy teams refused to touch it. They had existing customers on annual update cycles that already required weeks of careful work. Switching those customers to the new platform meant some customization wouldn’t carry over. A button in the wrong shade. A dashboard panel in a different order. The teams were measured on retention. They could see the risk of a customer calling to complain about a missing feature. They could not see the cost of a customer losing weeks of their year to a configuration update that existed only because the system was too bespoke to update efficiently. One risk had a number. The other was invisible.

The measurement system was working exactly as designed. It just couldn’t see what it was protecting, or what that protection was costing.

On December 9, 2021, a critical vulnerability in Apache Log4j became public. Within hours, it was clear the library was everywhere — buried in applications, tucked inside vendor products, woven into systems that hadn’t been inventoried in years. Severity: 10 out of 10. Exploitation was trivial. Attackers were already scanning.

The organizations with mature compliance programs knew their patch cycle targets. Thirty days for critical vulnerabilities. Documented policies, trained staff, audited controls. Their compliance dashboards could show the percentage of systems scanned, findings remediated, training completion rates. Green across the board.

None of that answered the question Log4j actually asked: how fast can you move?

Finding every instance meant searching places the vulnerability scanner didn’t reach. Vendor-packaged software. Internal tools nobody owned anymore. Transitive dependencies three layers deep in the build chain. Then the harder part — testing, prioritizing, sequencing changes across interconnected systems, coordinating deployments that couldn’t be handled one at a time because the services talked to each other. This wasn’t a patching exercise. It was an organizational change-velocity problem. Staying secure meant managing the change curve across the entire estate, in days, under pressure. The organizations that moved fastest weren’t the ones with the best compliance scores. They were the ones that had exercised the muscle of making rapid, coordinated changes because something in their operating model had demanded it before the crisis.

The compliance framework had a metric for part of this. Security teams measure how quickly a known vulnerability gets remediated. SLA adherence on critical patches. Mean time to fix. That’s real, and it mattered during Log4j. But the framework asks technology teams to prioritize fixing this week’s bug. It does not ask them to build the organizational capacity for large, coordinated, systemic change. Remediating a single CVE is maintenance; sequencing an estate-wide response across interconnected systems in days is a fundamentally different capability — and the organizations that looked identical on a compliance dashboard performed wildly differently when the answer depended on that capability instead of control coverage.

In 2014, Nicole Forsgren, Jez Humble, and Gene Kim started measuring what nobody had measured across software delivery teams: flow and failure simultaneously. The DORA research surveyed thousands of professionals across industries over four years, tracking four metrics — deployment frequency, lead time for changes, change failure rate, and mean time to restore service.

The conventional model assumed a tradeoff. Move fast or be stable. Ship frequently or ship safely. Mature organizations found their balance by slowing down. The assumption was so deeply embedded that most engineering leaders didn’t recognize it as an assumption.

Forsgren’s data demolished it. Across the 2014–2017 cohorts, elite performers deployed on demand — multiple times per day — with change failure rates under 15% and recovery times under an hour. Low performers deployed between once per month and once every six months, with failure rates several times higher and recovery times stretching into weeks or months. The teams that moved fastest didn’t just fail less often. They recovered faster when they did fail. Speed, stability, and recovery weren’t competing priorities — they were expressions of a single organizational capability: the capacity for controlled, coordinated, recoverable change.

The magnitude matters. Across years of DORA data, elite teams deployed up to 973 times more frequently than their lowest-performing counterparts, while maintaining lower failure rates and faster recovery. The 2019 State of DevOps Report found they were twice as likely to meet or exceed their organizational performance goals. Organizations that had slowed down to be safe hadn’t just become slow. They had become fragile in ways their own metrics couldn’t detect — because their metrics only measured the failure rate, not the capacity to respond.

None of this was new. Deming argued in 1986 that quality comes from improving the process, not inspecting the output. You cannot improve a system by measuring only its failures. The insight is seventy years old. DORA proved it with data for software. The governance frameworks that shape how technology organizations operate have yet to catch up.

The dominant IT governance, service management, and compliance frameworks — NIST 800-53, ITIL, COBIT, SOC 2 — were built for technology organizations. They weren’t borrowed from another industry and misapplied. They were designed, specifically, to govern how technology teams operate. And every one of them structurally rewards change aversion. ITIL’s change advisory board is a gate. NIST 800-53’s configuration management controls verify that changes go through approval processes. SOC 2’s CC8.1 requires that every change is authorized, tested, approved, and documented. COBIT’s BAI06 demands traceability from request to go-live. 1,196 NIST controls across twenty families, and not one measures whether the organization retains the capacity to change at the speed its threat environment demands.

Forsgren’s research tested the most visible of these gates directly. Accelerate found that external change approvals — the kind ITIL’s CAB produces — were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. The gate slowed delivery without improving safety.

These frameworks didn’t just omit change velocity. They embedded mechanisms that slow change and then measured compliance with those mechanisms. An auditor checking CM-3 verifies that your change control process exists and functions. No auditor asks whether that process is producing the organizational rigidity that will prevent you from responding to the next Log4j in time. The frameworks shaped an entire industry’s behavior. The behavior they shaped produces brittleness. That’s not a gap in scope. It’s a design flaw in the governance model itself.

The risk register has a field for “probability of failure if we make this change.” It does not have a field for “probability of failure if we don’t.” So the locally rational decision, every quarter, is the same. Don’t move. Report green. And hope the forcing function — the Log4j, the vendor EOL, the customer who finally leaves — doesn’t arrive before next quarter’s review.

I didn’t fix the SaaS onboarding problem by convincing the legacy teams they were wrong. They weren’t wrong. Their customers did care about their configurations. The risk of a complaint during migration was real. Every objection was valid inside the frame they were operating in. Arguing with them was arguing with the metric.

So we went around it. New sales engineers, working new accounts, onboarded on the new platform from day one. No legacy configurations to protect. No annual update cycles to preserve. No measurement frame that could only see what might be lost. The new cohort onboarded customers in a fifth of the time. Retention climbed beyond anything the legacy book had produced.

Then the legacy customers migrated themselves. The annual configuration updates — the weeks-long marathons the old teams had been protecting — dropped to a single day on the new platform. Customers weren’t attached to their bespoke UI tweaks. They were exhausted by them. When the update dropped from weeks to a day, the customers the legacy teams had been afraid of losing became the loudest advocates for the new system.

The thing the team was protecting was never what the customer valued. The customer valued their time. Nobody had been measuring it.

DORA’s research demonstrated the same principle at scale. Teams tracking flow alongside failure outperformed the ones tracking failure alone. The evidence was built outside the frame that couldn’t see it. Then the frame changed, because the results were undeniable.

Time-since-last-systemic-change belongs on the risk register alongside vulnerability count. Organizational change velocity — how quickly this team can execute a coordinated, cross-system modification — belongs in the security review alongside control coverage. Mean time between major architectural changes is a leading indicator of brittleness, the same way deployment frequency is a leading indicator of delivery health.

There’s a question that will tell you where you stand. Ask it in your next quarterly review: how long would it take us to integrate a new AI capability across our production systems? Not a proof of concept. Not a sandbox. A coordinated change touching identity, data flow, access controls, and monitoring — deployed, secured, and operational.

AI is the forcing function that makes this question impossible to defer. The WEF Global Cybersecurity Outlook 2026 reports that 87% of organizations identified AI-related vulnerabilities as the fastest-growing cyber risk over 2025. The IBM X-Force 2026 Threat Intelligence Index observed a 44% increase in attacks exploiting public-facing applications, with AI-enabled vulnerability discovery accelerating the pace. The familiar tradeoffs are already on every CISO’s radar: don’t adopt AI and the business falls behind, adopt it without security rigor and you’re exposed.

But there’s a third failure mode that no compliance framework measures. If your organization lacks the velocity to respond to AI-identified vulnerabilities — the ones arriving faster and in greater volume than any thirty-day patch cycle was designed for — you’re exposed regardless of whether you adopted AI or not. The attackers did. That velocity gap is the one nobody’s tracking, and it’s the one that will determine which organizations survive the next forcing function, whether it’s AI-driven or not.

If the answer to the question is “we’d need to assess the risk first,” listen to what that sentence is actually saying. The risk of moving is the only risk the system can articulate. The risk of standing still has no field in the form.

The breakout looks the same in security as it did in SaaS. Pick a smaller line of business. Choose a non-critical system. The stakes are low, which means the risk is low, which means you can start at lower fidelity and iterate rapidly. The first version won’t have full capability. It doesn’t need to. What it needs is the cycle — build, test, learn, improve — running fast enough that quality arrives through iteration instead of through planning. By the time the broader organization is ready to evaluate the new pattern, it’s no longer a proposal with known gaps. It’s a working system with proven results. The legacy teams won’t adopt the new architecture because you argued them into it. They’ll adopt it when the evidence from a working implementation makes the old approach indefensible.

Every organization sets its own failure constraint. For cybersecurity, it may be zero breaches. For platform reliability, four nines. For product delivery, an acceptable regression rate. The constraint is yours to define.

The argument is about mechanism. Whatever your constraint, the system designed to meet it by preventing change will produce the opposite of what it intends. The path to meeting that constraint runs through motion, not stillness.

The compliance dashboard will be green the morning of the incident. It always is. What’s missing is the number that would have told you the incident was coming — the field in the form, the line on the dashboard, the figure in the quarterly review that makes the cost of not moving as visible as the cost of moving.

The organizations that can move have the fewest failures. Quality and throughput, measured together, are how you build that capacity.

References

Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps (IT Revolution Press, 2018). The four DORA metrics and the empirical finding that speed and stability are correlated, not opposed, based on research from 2014–2017 across thousands of organizations.

DORA team, State of DevOps Reports, 2014–present. The annual research program that produced the data underlying the Accelerate findings and continues to track software delivery performance globally.

CVE-2021-44228 (Log4Shell), publicly disclosed December 9, 2021. CVSS 10.0. Apache Log4j remote code execution vulnerability affecting hundreds of millions of systems. CISA Director Jen Easterly called it “one of the most serious [vulnerabilities] I’ve seen in my entire career, if not the most serious.”

NIST SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 1,196 controls across 20 families. The primary control baseline for U.S. federal information systems and the most widely adopted security control catalog in the industry.

World Economic Forum, Global Cybersecurity Outlook 2026. 87% of respondents identified AI-related vulnerabilities as the fastest-growing cyber risk over 2025.

IBM X-Force, 2026 Threat Intelligence Index. 44% increase in attacks exploiting public-facing applications, with AI-enabled vulnerability discovery cited as an accelerating factor.

W. Edwards Deming, Out of the Crisis (MIT Press, 1986). The foundational argument that quality comes from improving the process, not inspecting the output.

Discussion about this post

Ready for more?