Sidebar #4: Quantitative Risk Revisited
This is the sixth part of my series on thinking about risk, and last “sidebar” to part 1, my basic introduction to risk. It’ll make the most sense if you’ve read that piece and understand the terms risk, likelihood, and impact that I discussed there.
In Part 1, I briefly covered quantitative risk measuring – assigning a numeric value to risk, like “$3,500”, rather than a qualitative label like “medium”.
To refresh, the basic idea is estimate the probability of an event, estimate the cost, and then do the math, e.g.:
- We have a 50% chance of a bad deploy causing an outage of an hour or less.
- A outage of an hour costs us $5,000.
- Thus, the Risk of this scenario is $2,500/yr (50% × $5,000).
But then I pretty quickly suggested not trying this technique, writing:
However, in practice this kind of formal analysis is rare, at least in the contexts I’m familiar with (security, engineering, wilderness travel). I’m presenting it first to show that it’s totally possible, that this framework has robust analytical underpinnings, but also to recommend that you don’t try it, at least at first.
In this final sidebar, I want to come back to quantitative risk measurement. I’ll spend a bit more time explaining what I see as the pros and cons of quantitative risk measurement – why you might or might not want to use numeric values over more simple risk matrixes.
What problems arise with qualitative risk measurement?
Using qualitative labels for risk is easy and convenient, but they’re sort of definitionally sloppy and imprecise. Previous sidebars have touched a couple of the problems of imprecise labels:
- It’s hard to “do the math” on multiple risks — there’s no way to say “3 Low risks plus 4 Medium risks equals…”. This is a serious problem for risk analysis because real world accidents/incidents are almost always caused by multiple contributing risks.
- Using a risk matrix collapses important context about how you reached that label. This is most easily spotted by noting the two different kinds of Medium risk, but it’s a problem with other categories, too. Knowing that something’s low risk because there’s negligible impact vs low risk because the scenario’s so unlikely is important, and that context gets lost when collapsing labels.
More broadly, while simple labels feel intuitive, they can actually pose substantial challenge to risk communication and management. To try to explain this, let’s look at a couple of facts from accidents in two different wilderness contexts:
- Avalanche forecasters use a five-level danger scale. You might think that most accidents happen when the danger is rated at Level 4 (“High”) or 5 (“Extreme”) , but in fact more fatalities happen at Level 3 (“Moderate”)) than at any other level – and 20% of fatalities happen at a Level 1 (“Low”) danger rating!
- The whitewater community uses a six-level difficulty classification scale. Again, you might assume that more accidents happen riskier water, but, once again, more fatalities happen in Class III (“Intermediate”) water – and 16% of fatalities happen in Class I-II water.
There’s a lot going on here — risk homeostasis, the Dunning-Kruger effect, and more — but difficult risk communication when using imprecise labels is a major factor. “Intermediate” or “Moderate”/“Considerable” don’t sound all that scary — especially on scales that include “Extreme” and “Expert” — so it’s easy to see why someone with beginner-level skill might think they’re ready to push into more intermediate challenges. In the whitewater world, many drownings in Class I (“Easy”) water have alcohol and/or lack of life jacket as contributing factors, and once again it’s easy to see how someone might conclude that “Easy” water means it’s cool to drink or not wear a life jacket.
This dynamic plays out in security/engineering contexts: management/leadership hears “low risk” and decides to de-prioritize mitigation (or accept the risk). Frustrated, the security team discovers that if they call something “High” or “Critical”, it does get acted upon, so they begin inflating risk assessments to make sure something happens. But now there’s even less time for mitigation work, so low-severity issues get de-prioritized even further. Quickly this devolves into a dysfunctional situation with only two categories: “too low priority to fix” or “pants on fire emergency”. This dynamic is depressingly common – I think I’ve seen it at at least a dozen organizations.
As with avalanches and whitewater, there’s more going on here than just communication – but still, communication is a contributing factor in creating this dysfunctional dynamic.
Why might you use quantitative risk analysis?
So: there are a number of problems with these simple labels and risk matrixes. That might be reason enough to drive you towards trying out quantitative measurement. But even if you’re not experiencing any of those problems, there are still a few good reasons why you might be interested in quantifying risk:
It can (kind of, sometimes) be a compliance requirement. Many compliance regimes require that you maintain a risk register – a document tracking all of the organization’s risks and actions against those risks. You can absolutely use qualitative labels in a risk register – I’m not aware of any compliance regime that requires quantitative measurement in these documents – but in practice auditors and compliance experts tend to push for quantitative techniques in risk registers. This makes sense: if you’re going to invest the time in maintaining a quality risk register, at that point there’s not much additional effort involved in estimating likelihood and impact numerically.
You can clearly calculate return on investment (ROI). When considering a mitigation, it’s common to ask, “is this worth the effort?” This is a hard question to answer with risk matrixes: if it costs $1,000 to reduce a risk from High to Medium, is that “worth it”? This question is easier to answer with hard numbers: if it costs $1,000 to reduce risk from $3,000 to $1,000, that a $1,000 ROI. Being able to do these sorts of calculations can make decisions easier.
It’s easier to back-test estimates and assumptions. If you’re writing down specific estimates about probability and consequences, you can now get feedback on how good your estimates were. E.g., if you estimated that a day of downtime would cost the company $100,000, and then you had a day of downtime and it actually cost $200,000, you now know you were off by 2x and can presumably adjust further estimates accordingly. You can’t really do this with “impact: medium”.
You’ll have an easier time putting security work up against other disciplines. Security leaders often really struggle at the executive level because they lack precision. Sales executives bring sales forecasts: “if we build a State/Local Government salesforce, it’ll cost $X and we’ll make $Y”. Engineering executives can forecast costs: “switching from AWS to GCP will cost $X and we’ll save $Y”. Against this precision, security executives without quantitative forecasts sound like children: “this is really bad, we should do something about it!”
So why do I recommend against it, then?
Right: there are problems with risk labelling, and there are upsides to quantitative measurement – so why do I recommend against trying it (at least at first)?
It’s a more advanced technique
Mostly: because it takes discipline and sustained long-term effort to use quantitative measurement effectively. It’s not particular useful for helping folks new to risk management make decisions in a timely manner. Your first risk estimates take you a long time, and will be so inaccurate that your guess at a probability number won’t be much better than just saying “medium”. One of the core precepts of quantitative risk analysis is that estimation is a skill – and you can better at it over time. This statement is absolutely true, but it leaves out the other side of the coin: like any skill, when you first try it, you’ll probably suck!
Multiple analysis paralysis traps
I’ve also found, through hard-won experience, most people fall into one or more “traps” – ways to waste tons of time – and thus get stuck when trying to deploy quantitative risk measurement techniques. The most common traps I’ve seen are:
- Trying to achieve completeness – a sense that they need to articulate literally every single potential risk scenario, thus spending way too much time just brainstorming scenarios.
- Falling into an estimation rabbit hole trying to achieve an implausibly-high standard of precision on the probability/impact of each possible scenario.
- Bikeshedding from leadership, where discussion of risk turn into arguments over the specificity of probability/cost about a particular scenario (or all scenarios).
- The “I don’t know” trap, where a lack of confidence in estimating risk scenarios leads to paralysis.
All of these can cause a risk estimation project to spin out of control and end up taking days, weeks, even months of time with little to show for it.
I don’t want to place a value on human life
Finally, I’ve also founds that quantifying certain kinds of risk leads to a style of reasoning that feels “icky” or unethical. Sometimes, you’re faced with a scenario where the impact is “someone might die”, or similar. This is more common in the wilderness scenarios I’ve been writing about, but it happens in software too – imagine doing risk analysis on an automated insulin pump, for example. To me, it feels fundamentally inappropriate to assign a dollar value to human life.
I’m fully aware that reasonable people disagree with me here, and that there are standardized ways of assigning quantitative value to life. I just … they don’t feel okay to me. The fact that these techniques mostly come fro the health insurance industry maybe hints at why I find them morally bankrupt. Assigning numeric values in these situations serves to distance and hide information: “$300,000 of risk” hits pretty differently from “10% chance you might die”.
This isn’t necessarily an argument against using quantitative risk analysis techniques generally, but it is an argument about using them as the only technique. It can’t be the only tool in your pocket.
Overall, quantitative risk analysis is both better and worse
To sum up: quantitative risk analysis is “better” in the sense that it’s more precise, gives you more tools and options, and can take you more places. But it’s got a much steeper learning curve, has some dangerous traps, and so in the hands of a beginner it’s probably worse.
I think I’ll leave it there for now. If you’d like to dig deeper into quantitative risk measurement, these should be the next things you read:
- Ryan McGeehan (aka @magoo): Simple Risk Measurement
- Douglas W. Hubbard & Richard Seiersen: How to Measure Anything in Cybersecurity Risk
I do have some thoughts on using these techniques effectively and avoiding some of the “traps” I mentioned above; I may share that in a future piece. Contact me if you’d like to read that, or if you’ve got questions, or other topics you’d like to see covered in this series.
Next up: with these “sidebars” complete, it’s back to the main series. I’ll return to risk mitigation, and present my framework for choosing mitigation efforts. Stay tuned!