Security Brutalism Under Real Conditions, Part 4: Building the Program
Part 3 covered how to build a working inventory and a consequence map. With those in hand, the work becomes concrete. Everything from this point gets prioritized against the top of that list.
Most security programs add before they subtract. A new threat surfaces, a vendor shows up with a solution, a compliance framework adds a requirement, and the stack grows. Each addition seems justified in isolation. The cumulative effect is an environment that is harder to understand, harder to operate, and harder to defend than it needs to be.
Hardening starts with removal.
Take the security tool stack and apply a single test to each item: does it reduce the susceptibility of the systems at the top of the consequence map, limit the blast radius if those systems are compromised, or reduce the time it takes to detect and recover from compromise of those systems? If the answer is no, or if no one can articulate a specific answer with evidence, that tool is a candidate for removal.
This is not a comfortable exercise. Tools were purchased for reasons, often good ones at the time. Some exist because a compliance framework requires them. Some because a previous incident created pressure to buy something. Some because a vendor relationship made it easy. The question is whether the tool demonstrably improves survivability for what matters now, not whether there was once a reason. Every tool that cannot answer that question is adding complexity, consuming maintenance time, running with elevated privileges, and representing a potential supply chain exposure. Security tooling has been a primary attack vector in high-profile compromises. The burden of proof for keeping something in the environment should be high.
Access follows the same logic. List every standing permission, long-lived credential, and access grant to consequential systems. Every item on that list needs a current documented business need. Anything that cannot produce one gets revoked. Access accumulates silently through normal operations: team changes, projects that end but leave their service accounts behind, integrations provisioned in one sprint and never reviewed in the next. The quarterly entitlement review is the only mechanism that pushes back against this consistently. It is not a compliance exercise. It is the practice of asking, regularly, whether the access that exists today is the access that should exist today.
After the subtractive work comes structural decisions. These follow directly from the survivability questions.
No standing access to consequential systems. Access is granted for specific tasks, scoped to minimum necessary permissions, with a defined expiration and a full audit trail. The operational requirement behind this is that compromising a credential today should not automatically yield persistent access that was granted months ago and never reviewed. Just-in-time access workflows are the implementation. The security reason is not compliance with a framework. It is that persistent access is the mechanism by which stolen credentials convert into sustained intrusion.
Separation of duties for high-consequence actions. Any single action that can cause irreversible damage should require a second human or a mandatory review step before execution. Deleting production data, modifying access controls, deploying to production in ways that cannot be rolled back, moving significant funds. The implementation belongs in deployment and change management workflows. This slows attacker progression and creates a detection window. When an attacker compromises a credential and attempts an irreversible action, the review requirement is the gap where detection can operate.
Blast radius by design. Every consequential system gets segmented so that full compromise of a neighboring system does not automatically yield access to it. The architecture question is: if everything with a trust relationship to this system is fully owned by an attacker, what can they do? If the answer is "everything the system can do", the segmentation needs to change. This is not about preventing compromise. It is about bounding what happens when compromise occurs.
Friction on high-consequence paths. Re-authentication before high-consequence actions, step-up verification before irreversible operations. This adds friction for legitimate users as well, and that tradeoff is correct. The amount of friction applied should be proportional to the consequence of the action being protected. Applied thoughtfully, it does not overwhelm normal work and still protects what matters.
Seeing is Detecting
The detection standard is not whether a logging platform is deployed or whether alert volumes look active. It is whether you know when your consequential systems are being attacked before the attacker reaches the objective.
A lot of programs fall short of this test in practice. They have high alert volumes that nobody reads. They have logs that exist for auditors and are never queried for real adversary behavior. Teams normalize ignoring alerts because the signal-to-noise ratio is too low to act on all of them. That normalization is the most dangerous condition in the detection stack. When teams stop reading alerts, real signals disappear into the background indistinguishable from noise.
Detection architecture should be built against the attack paths identified in the inventory phase. If you know the realistic paths an attacker would take toward each high-consequence system, you can instrument the chokepoints on those paths rather than trying to achieve uniform coverage across everything.
Behavioral baselines on consequential systems are the foundation. Normal access to a system follows specific patterns: particular identities, at specific times, performing specific operations. Deviation from established baseline on high-consequence systems should produce an immediate alert, not a batch summary at the end of the week. The specific anomalies worth alerting on immediately: first-time access from any identity to a consequential system, access at unusual hours for that identity, unusual data volume, changes to access controls or audit logging configuration, new process execution on critical infrastructure.
Detection of lateral movement requires instrumenting the paths between systems, not only the endpoints. If an attacker pivots from a compromised workstation through an internal service to a database, you want to detect the movement at each stage rather than only detecting arrival at the final target.
Deception assets are among the highest signal-to-noise detection investments available per unit of engineering effort. Honeytokens, canary credentials, honeydocuments placed in locations that legitimate users would never access: old backup directories, decommissioned service account configurations, internal documentation no longer referenced by active systems, historical configuration files. Any activation of these is high-confidence evidence of active exploration. Near-zero maintenance. Near-zero false positives. When one fires, it is a priority investigation regardless of other workload.
Deploy them at multiple layers: a canary cloud credential in a build artifact that only someone with source code access would find, a fake API key in an internal wiki page, a canary database credential in a configuration file that is no longer in use. An alert from any of these means someone is actively looking.
Alert volume discipline is part of detection engineering, not an operational afterthought. If the team has normalized skipping alerts because volume is too high or false positive rates are too common, prune alert rules until that stops. This often means accepting that some low-confidence signals are dropped, in exchange for every high-confidence signal receiving immediate attention. A small set of reliable alerts that always get investigated beats a large set of noisy ones that nobody reads.
The Gap Between Assumed and Actual Recover
This is where most programs are most fictional.
Backup systems exist but have never been restored under realistic conditions. Recovery procedures are documented but have never run under pressure. Kill switch processes exist on paper but have never been timed. These gaps are invisible until an incident reveals them, at which point the cost of discovering them is orders of magnitude higher than the cost of finding them in a test.
The survivability test for each consequential system covers four measurements. All four require evidence, not estimates.
Assume compromise right now. What can a realistic attacker do with current access? What data can they reach? What actions can they take? What else can they pivot to? This defines the current worst-case blast radius. If this has not been answered from the inventory and hardening work, start here.
Time to detect. Based on actual monitoring in the current environment, how long before a human is looking at the right data and understands what is happening? Not theoretical alert times. The realistic path from event occurrence to human understanding, including escalation time, queue depth, and the accuracy of the first investigation response.
Time to contain. How long to revoke all access, isolate the system, and stop further damage? This requires tested procedures for each consequential system. The test is not reading the documentation. It is revoking access and measuring how long the process takes end-to-end. If revoking a compromised service account requires coordinating multiple teams and manual steps taking hours, that duration is the actual blast radius window.
Time to restore. Can you restore from backup to a verified working state, and how long does it take? Not "verify the backup exists". Actually run the restoration process in a test environment, end-to-end, under realistic conditions, and measure it. Run this quarterly for consequential systems, rotating through so each one gets tested at least once a year. What breaks in a scheduled test can be fixed before it matters. What breaks during an actual incident under time pressure is a different kind of problem.
The gaps that surface in recovery exercises are almost never technical. They are unclear ownership when multiple teams are involved, slow escalation because nobody has clear authority to make decisions under pressure, and communication failures because the people who know the system and the people authorized to act are not the same people. Those are the gaps that determine actual recovery time, and they only surface when you run realistic exercises built around the attack paths from your inventory, not generic tabletop scenarios.
Chaos engineering extends this to the infrastructure level. Introduce controlled failures in non-production environments and measure actual detection and response: how long to an alert, how long to human acknowledgment, how long to accurate understanding, how long to containment. Every assumption about detection and response times should be validated by evidence. The assumption that "the SIEM will catch that" or "we can revoke credentials in five minutes" should be demonstrated, not taken on faith.
Cadence and Entropy
None of this has a finish line. Security degrades the moment a system goes live. Permissions accumulate as teams change and integrations are added. Firewall exceptions get approved and never revisited. Service accounts outlive their purpose. Credentials go unreviewed. Alert coverage drifts as systems evolve. This is not a failure of vigilance. It is the natural behavior of any production environment under continuous change pressure.
The operational response is a predictable cadence that pushes back against entropy before it compounds.
Quarterly: entitlement review of all access to consequential systems. Every access grant reviewed against current business need. Every integration reviewed against current use. Anything that cannot be justified is revoked. Also quarterly: restoration test for at least one consequential system. Full restoration to a test environment, timed, with gaps documented.
Annually: red team exercise scoped to the actual consequence map. Not CVE hunting or perimeter penetration testing. Attack path simulation against the systems that would actually damage the business if compromised. The exercise should use realistic paths from the inventory and measure actual detection and response capability under sustained adversarial pressure.
Continuously: every proposed new tool, integration, or access grant evaluated against the three survivability questions before approval. Anything that cannot justify its contribution does not get added.
Monthly: pick one consequential system and walk through its current access list, current network paths, and current detection coverage. Ask what has changed since the last review. Something always has. The discipline is catching drift while it is still small.
Metrics
The metrics most commonly reported describe the existence of controls, not their effectiveness. Tool coverage percentages, vulnerability counts, and compliance scores describe inputs. They do not answer whether the program is working.
Time to detect, measured from an event occurring to a human understanding what is happening, tells you whether detection is functioning. Time to contain and time to restore, measured from actual incidents and test exercises rather than runbook estimates, tell you whether recovery capability is real. Blast radius per consequential system, tested rather than assumed, tells you the actual scope of a compromise. Alert signal quality, meaning the proportion of alerts that represent real activity worth investigating, tells you whether detection is producing signal or noise.
These metrics require operational discipline to produce. They are also the only ones that answer the question the program is trying to answer: how long do we stay failed, and is that number getting smaller?
If you’re interested in building a stronger security program along these lines, you can reach out at Black Arrows.