Survivability Engineering, Part 1: Risk Assessment

Too much of cyber risk quantification today chases the illusion of precision. The spreadsheets promise clarity, with rounded loss numbers and simulated curves, but, in my opinion, they rarely survive contact with reality. The problem is not measuring risk, it is how we have been trying to measure it. Models built on weak data, subjective assumptions, or just to please the Board create a tidy sense of control that costs millions to maintain and seldom influences real decisions.

It is time to pivot from abstract probability math to concrete, evidence‑based survival scenarios. Replace "the annualized loss is $3.2M" with "this system could fail like this, for these reasons, and it would keep us down for this long". That reframing turns risk from a financial fiction into an engineering discipline I'm calling Survivability Engineering, where the goal is simple: maximize real‑world resilience for every dollar spent.

Assessing Risk

When you're assessing risk, forget the spreadsheets for a minute. You're not trying to calculate a number, you're trying to understand how this thing fails under pressure.

Start with something concrete. A system. A product. A workflow. Not "enterprise risk". One real thing.

First question: why does this exist?

You want to understand what business function it supports and who actually depends on it. If it goes down, what stops? Who screams first? Revenue? Operations? Legal? If no one can clearly explain why it matters, that's already useful information. Either it's low impact, or it's poorly understood. Both are risk signals.

Once you understand why it matters, shift to how it could be attacked. Ask how an attacker would realistically touch it. Not theoretical nation-state zero-days. Real entry paths. You want to know how it's exposed, who can log into it and talk to it, and what other systems connect to it. Is it reachable from the internet, from user devices, or only internally? Could a compromised laptop reach it? Could a stolen SaaS token access it? You want to build an attack graph. Remember, attackers think in graphs. You're mentally tracing the shortest believable path from "attacker exists" to "attacker has control".

Now here's the key move: assume compromise.

Don't ask how likely it is. Assume someone gets in and has real access. What can they actually do? Can they read sensitive data? Change transactions? Create admin accounts? Push code? Shut it down? Pivot to something more critical?

Then translate that into business language. If that happens, what breaks and for how long? Are we down for hours, days, weeks? Are customers locked out? Are orders halted? Are we in breach of contract? Would regulators care? Is this material? Does this need an 8k to be submitted?

Avoid vague words like "high impact". Force yourself to say something concrete, even if it's uncomfortable: "we would not be able to process transactions for three days". That's risk clarity, and I think that's what senior leadership and the Boards want to know.

After that, look at survivability.

If this system is hit tomorrow, how fast would we notice? Not theoretically but based on how we actually monitor today. How fast could we contain it? Could we isolate it, disable accounts, cut connections? How fast can we contain this and could we restore it cleanly?

This last point is where most theater dies. Backups might exist, but have we restored from them recently? Logging might be enabled, but does anyone actively review or alert on it in a meaningful way? Access controls might be defined, but are they tight or are they "temporary" exceptions that became permanent? You're not checking whether controls exist, but whether they hold under stress.

At this point, you should be able to say, plainly:

"If this system is compromised in a realistic way, the worst credible damage is X. We would likely be disrupted for Y time. Our biggest weakness is Z".

That sentence is your risk assessment.

Only then do you decide what to do.

Ask yourself: what single change most reduces the damage or the downtime? What actually limits blast radius or shortens recovery. Not what improves our maturity score. Not what aligns nicely to a framework.

Often it's something unglamorous, like tightening privileged access, segmenting one critical dependency, testing restores, clarifying incident ownership. The right move is usually the one that materially improves containment or recovery speed.

Make that change. Then reassess.

If you later need to bring money into the conversation, you can. Take the downtime you identified and multiply it by a rough cost per day. That gives you a dollar view. Compare that to the cost of the improvement. Now you have a financial discussion.

But don't start there. If you don't understand how you fail and how long you stay failed, adding $ just gives false precision.

Risk assessment, done well, is simple.

Understand how this thing gets compromised.
Understand what really breaks when it does.
Understand how long you stay broken.
Improve that.

Then repeat.

You can download the simplified step list (PDF)

Go to Part 2.