Applying Security Brutalism, A Playbook for Leadership and Practitioners

Part 2: For Security Architects and Engineers

Phase 0: Building the Consequence Map in Practice

The consequence map is the prioritization order for everything in Phases 1 through 4. Getting it right matters more than any other single step, and the way to do it is through structured conversations with system owners, not a form-filling exercise done independently.

For each system, the conversation covers: what is the purpose of this system, who and what depends on it, what data does it hold or process, what does realistic worst-case compromise look like, and what does full recovery look like with current capabilities. The output is a ranked list with systems ordered by consequence severity.

Phases 1 through 4 get applied depth-first to the top of this list. If capacity is limited, accurate and complete coverage of the top two or three systems is more valuable than partial coverage of everything. An honest and detailed map of your crown jewel systems beats a comprehensive but shallow map of everything.

Phase 1: Know, Inventory in Practice

The inventory requirement covers four areas: identities, trust relationships, data flows, and external attack surface. The goal is accurate, queryable knowledge of what can reach what, not a configuration management database that drifts out of accuracy as soon as it is built.

For identity inventory, start with a query of your identity provider for all non-human identities: service accounts, API keys, OAuth tokens, CI/CD pipeline credentials, machine identities, and long-lived tokens. Most organizations have three to ten times more of these than they expect. For each identity, the questions are: what does it have access to, when was it last used, and does it have a current documented owner and business purpose?

Identities with no recent use and no documented owner get revoked first. An attacker who compromises one of these pays no operational cost because no legitimate process will notice the credential being used.

For trust relationship mapping, the goal is to trace what can actually reach your consequential systems rather than what architecture diagrams say should be able to reach them. These diverge constantly through firewall exceptions, new integrations, and organic growth. The practical method is to trace from each consequential system backward: what has a trust relationship or network path to this system, and what does that thing connect to? Repeat until you have a full dependency graph for each high-consequence system.

For external attack surface, the questions are: what is reachable from the internet, and for each externally reachable component, what can it access internally if compromised? Third-party integrations and SaaS tools with access to internal data or systems belong on this list. For each integration, document what it can reach if that vendor's systems are compromised.

The documentation format matters less than accuracy and accessibility. What you need is to be able to answer "if system X is compromised, what can the attacker reach?" in under ten minutes at any point in time.

Phase 2: Harden, Subtractive Work First

Hardening starts with removal, then moves to structural improvement. Adding controls before removing unnecessary complexity is a common mistake that makes the environment harder to defend.

For tooling, go through every security and infrastructure tool in your stack and apply the three questions: does it reduce susceptibility, damage, or recovery time for your high-consequence systems? If the answer is no or uncertain, flag it for removal. The bar for keeping a tool in the environment is not that it was purchased or that someone uses it occasionally. The bar is that it demonstrably improves survivability for what matters. Security tooling has been a primary attack vector in supply chain compromises; every tool in the environment is a potential target.

For access, enumerate all standing permissions, long-lived credentials, and service account grants to consequential systems. Every access grant needs a current documented business need. Anything that cannot be justified gets revoked. This goes on the quarterly review calendar, not as a compliance exercise but because access accumulates silently through normal operations and becomes attack surface without anyone intending it.

For integrations, map every third-party connection and API integration to consequential systems. Remove what is not actively used. For each integration that remains, document what it can access if that integration is compromised, and make sure that answer is acceptable.

The structural hardening principles follow from the three questions.

No standing access to consequential systems means access is granted for specific tasks, scoped to minimum necessary permissions, with a defined expiration. Just-in-time access workflows with full audit trails are the implementation. The operational requirement is that compromising a credential today does not automatically yield long-term persistent access because that credential was granted standing permissions months ago and never reviewed.

Separation of duties for high-consequence actions means any single action that could cause irreversible damage requires a second human or a mandatory review step. This covers deleting production data, modifying access controls, deploying to production, and moving significant funds. The implementation belongs in your deployment and change management workflows. The security reason to do this is not compliance: it is that requiring two parties for a destructive action slows attacker progression and creates a detection window that would not otherwise exist.

Blast radius by design means segmenting every consequential system such that full compromise of a neighboring system does not automatically yield access to it. The architecture question to ask about every high-consequence system is: if everything with a trust relationship to this system is fully owned by an attacker, what can they do? If the answer is "everything the system can do," the segmentation needs to change.

Friction as protection is intentional and proportional. Re-authentication before high-consequence actions, step-up verification before irreversible operations, and access justification requirements slow attacker propagation and create detection opportunities. They add friction for legitimate users as well. That tradeoff is correct. The amount of friction applied should be proportional to the consequence of the action being protected, so it does not overwhelm normal workflows while still protecting what matters.

Minimal footprint is an ongoing practice, not a one-time hardening pass. Every running service, open port, installed package, and granted permission that is not actively needed is attack surface. Build reduction into normal operational cadence rather than treating it as a periodic cleanup project.

Phase 3: See, Detection Engineering

The detection standard is whether you know when your consequential systems are being attacked before the attacker reaches their objective. Not whether a SIEM is deployed, not whether EDR coverage metrics look good, but whether the detection capability produces actionable signal about real adversary behavior before damage occurs.

Build the detection architecture around your consequence map and the attack paths identified in Phase 1. If you know the realistic paths an attacker would take toward each high-consequence system, you can instrument the chokepoints on those paths rather than trying to achieve uniform coverage everywhere.

Behavioral baselines on consequential systems are the foundation. Normal access to a system follows specific patterns: particular identities, at specific times, performing specific operations on specific data. Deviation from established baseline on high-consequence systems should produce an immediate alert, not a batch summary at the end of the week.

The specific anomalies worth alerting on immediately:

First-time access from any identity to a consequential system
Access at unusual hours for that identity
Access from unusual locations or IP ranges for that identity
Unusual volume of data access or export
New process or service execution on critical infrastructure
Any changes to access controls or audit logging configuration on high-consequence systems

Detection of lateral movement requires instrumenting the paths between systems, not only the endpoints. If an attacker pivots from a compromised workstation to an internal service to a database, you want to detect the movement at each stage rather than only detecting arrival at the final target. This requires network visibility between segments and identity anomaly detection across system boundaries.

Honeytokens are among the highest-signal detection investments available per unit of engineering effort. They require almost no maintenance after deployment and produce near-zero false positives. The implementation involves placing canary credentials, API keys, and files in locations that legitimate users would never access: old backup directories, decommissioned service account configurations, internal documentation that is no longer referenced by active systems, historical configuration files. Any activation of these assets is high-confidence evidence of active exploration.

Deploy honeytokens at multiple layers of the environment: a canary cloud credential in a build artifact that only someone with source code access would find, a fake API key in an internal wiki page, a canary database credential in a configuration file that is no longer in use. When any of these activate, it is a priority investigation regardless of other workload.

Alert volume discipline is part of detection engineering, not an operational afterthought. An alert that no one reads is not detection; it is noise with a logging cost. If the team has normalized skipping or ignoring alerts because volume is too high or false positive rates are too common, prune alert rules until every alert gets investigated. This often means significantly reducing alert volume and accepting that some low-confidence signals are dropped, in exchange for every high-confidence signal receiving immediate attention. A small set of reliable, high-signal alerts beats a large set of noisy ones in every real incident.

Behavior-based detection catches what signature detection misses. Signature libraries require knowing what an attack looks like in advance. Behavioral anomaly detection surfaces deviations from normal regardless of the specific technique used. The detection engineering investment should prioritize behavioral baselines and deviation alerting over expanding signature coverage.

A calibration exercise worth running: if all your security tooling went dark right now, what specific thing would tell you that you are under attack? If there is no concrete answer to that, detection capability is more fragile than it appears.

Phase 4: Recover, Building Real Recovery Capability

Recovery capability is where the gap between assumed posture and actual posture tends to be largest. Backups exist but have never been fully restored under realistic conditions. Recovery procedures are documented but have never been executed under pressure. These gaps are invisible until they are not.

The survivability test for each consequential system covers four measurements, and all four need evidence, not estimates.

Assume compromise right now: what can a realistic attacker do with current access, what data can they reach, what actions can they take, and what else can they pivot to? This defines the current worst-case blast radius. Time to detect: based on actual monitoring in the current environment, how long before a human is looking at the right data and understands what is happening? Not theoretical alert times, but the realistic path from event occurrence to human understanding. Time to contain: how long to revoke all access, isolate the system, and stop further damage? Are documented and tested procedures in place for each step? Time to restore: can you restore from backup to a verified working state, and how long does it take with current capabilities?

These four numbers are the actual security posture for that system. If they have not been measured, the posture is unknown.

Kill switches are a concrete architectural requirement. For every consequential system, you need to be able to revoke all access within minutes, not hours. The implementation requires knowing every identity with access to the system (from Phase 1 work) and having tested revocation procedures for each of them. The test is not reading the documentation; it is revoking access and measuring how long the process takes end-to-end. If revoking a compromised service account requires coordinating multiple teams and manual steps taking hours, that duration is the real blast radius window.

Tested restoration means actually running the restoration process in a test environment, measuring time end-to-end, and documenting what breaks or is missing. Run this on a quarterly schedule for consequential systems. The value is finding the configuration dependencies, missing components, and procedure gaps that would surface during an actual incident rather than discovering them during one. What breaks in a scheduled quarterly test can be fixed before it matters; what breaks during a real incident under time pressure is a different kind of problem.

Incident response exercises should use the realistic attack paths identified in Phase 1, not generic scenarios. The goal is to find coordination failures, escalation delays, and ownership confusion before an attacker exploits them. Generic tabletop exercises where everyone knows the answer reveal little. Realistic exercises where the scenario is an actual attack path against a high-consequence system, run with the actual people who would respond, reveal operational reality. The gaps that consistently surface are not technical: they are unclear ownership, slow escalation, and communication failures under pressure.

Chaos engineering extends this to the infrastructure level. Introduce controlled failures in non-production environments and measure actual detection and response. Pick a realistic attack path, execute it in a controlled environment, and observe: how long to an alert, how long to human acknowledgment, how long to accurate understanding of what is happening, how long to containment. Every assumption about detection and response times should be validated by evidence before being relied upon. The assumption that "the SIEM will catch that" or "we can revoke credentials in five minutes" should be demonstrated, not taken on faith.

The Ongoing Work: Entropy Management

Security degrades the moment a system goes live. Access grants accumulate as teams change and integrations are added. Firewall exceptions get approved and never revisited. Service accounts outlive their purpose. Credentials go unrotated. Monitoring coverage drifts as systems evolve. These are not failures of individual vigilance; they are the natural behavior of any production environment under continuous change pressure. The operational discipline is to push back against this entropy on a predictable cadence.

Quarterly, conduct an entitlement review of all access to consequential systems. Every identity access grant is reviewed against current business need. Every integration is reviewed against current use. Anything that cannot be justified is revoked. This is a survivability exercise, not a compliance checkbox, which means it needs to be done honestly against the actual access list, not against what the access list is supposed to contain.

Quarterly, run a restoration test for at least one consequential system. Full restoration to a test environment, measured time-to-restore, documented gaps found. Rotate through systems so each one gets tested at least once a year.

Annually, run a red team exercise scoped to your actual consequence map. The exercise should simulate realistic attack paths against the systems that would actually damage the business if compromised. CVE hunting and generic penetration testing serve different purposes; this exercise is specifically about finding which attack paths succeed against current controls and measuring actual detection and response capability under sustained adversarial pressure.

Continuously, every proposed new tool, integration, or access grant gets evaluated against the three questions before approval. Anything that cannot justify its survivability contribution does not get added.

Continuously, maintain alert triage discipline. If the team is normalizing ignoring or deferring alerts, reduce alert volume until that stops. Noise hides signal. A high-confidence alert buried in a queue full of low-quality alerts is indistinguishable from noise until someone investigates it.

The practical monthly check: pick one consequential system and walk through its current access list, its current network paths, and its current detection coverage. Ask what has changed since you last looked. Something always has. The discipline is catching drift while it is still small.