Jacob Kaplan-Moss

DORA Metrics: the Right Answer to measuring engineering team performance

“What metrics should I use to measure my engineering team’s performance?”

I get asked this question often enough to be worth an FAQ.

You might expect my answer to be “it depends” but no: I believe there’s a single right answer! More precisely, there’s an established set of metrics that are so good, so widely applicable, that they should be your starting point. They may not work, or you may want to augment them with other metrics, but they’ll work for the vast majority of teams. Unless and until you have evidence that they’re not working for your team, you should use these metrics.

These metrics are:

  1. Deployment Frequency (DF): how often does your team successfully release to production? Usually measured in releases per time period (day/week/month). High-performing teams generally ship more often, and changes in DF tell you if your team is slowing down or speeding up.

  2. Lead Time For Changes (LT): measures the time interval between when a change is started (e.g. a ticket moved from your backlog to started, remediation starting on a vulnerability, etc.) and when it gets released. This measures your team’s efficiency and responsiveness. A high LT is often an indicator that your team is overloaded and doing too many things at once, or that you’re not breaking work up into small enough chunks.

  3. Change Failure Rate (CFR): what percentage of your changes cause a failure (an outage, a bug, a security issue, etc). CFR is a measure of quality control. A high CFR can indicate poor testing practices, lack of rigor in code review, systemic security issues, and so forth.

  4. Mean Time To Recovery (MTTR): when there is a failure, how long does it take to recover and return the system to normal operation? Teams with a low MTTR can be more bold: when failures happen, they can recover quickly. Emphasizing a low MTTR tends to push teams towards building more robust systems.

These are known as the “DORA Metrics”. DORA is the DevOps Research & Assessment group, a research group at Google. Their research found that the metrics above correlate with successful, high-performing teams. For more background, and a deep dive into that research, you should read Accelerate: The Science of Lean Software and DevOps.

Tips & Tricks for DORA metrics

Adopting these metrics in practice does take some skill; there are subtleties. I won’t go super in-depth here – for that, see Accelerate – but I do want to share a couple of quick tips and tricks I’ve picked up using these metrics over the years:

  1. Measure first; set targets later (or not at all). Remember Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure”. My experience with the DORA metrics is that prematurely setting targets doesn’t work; you need to measure and monitor these for a while – like, multiple quarters – before deciding what’s “good”.

    Here’s an example: say your engineering team deploys 3 times per week. Is that a “good” or a “bad” DF? Trick question: it’s impossible to know without more context. For some teams, three times per week is great; for others, it’s staggeringly slow.

    In many cases, setting targets at all is unnecessary. I find monitoring change over time to be more valuable than setting any particular target. Relative change tells that something about the team has changed (positively or negatively), and you can investigate the root causes and adjust accordingly.

    Either way, if you do want to set explicit targets, measure first; don’t guess.

  2. Carefully define ancillary terms. There are several terms that the DORA metrics use that you need to carefully define: “failure”, “recovery”, “change”, what “started” and “finished” means, etc. Even “deployed” can be more complex than it first appears: what if you deploy new features behind feature flags? Do you stop the clock on LT when the code hits production, or when 100% of users are flagged in? Somewhere in between?

    To some degree, the answers don’t really matter as long as you’re consistent. But it’s worth some time up front to carefully define exactly how you’ll be measuring these values.

A few real examples of applying DORA metrics

Finally, to make this concrete, I want to share a few examples of how to apply these metrics to different kinds of teams. These are all real examples:

Feature Team

The team: a feature team responsible for implementing product features, end-to-end.

How they defined and used DORA metrics:

  • DF: deploys per day, measured directly from their CD pipeline; each time main got deployed to the production cluster counts as a deploy (regardless of any feature flagging). Target range of 5 - 20 times per week: if any week’s count fell outside this range, they’d discuss it as a team to see if something might be up.

  • LT: measured from their ticket tracker (GitHub Issues); clock starts when an issue is moved into the “In Progress” column, and ends when the issue is closed. (Reopening the ticket restarts the clock, adding more time until it’s closed again). Goal was mean LT less than 1 week, designed to keep tickets very small.

  • CFR: percentage of deploys (as measured above) that had a failure. A “failure” was defined as any of the following:

    • a user-facing bug that was introduced in that deploy
    • a security issue that started in that deploy
    • the deploy caused an incident (as defined by their incident policy/guide) They had no specific target for CFR, but monitored it over time for upward drift.
  • MTTR: for any “failure” as defined above, the time between the failure was first noted and when it was resolved (either the incident declared over, or the bug resolved and deployed to production). Because bugs and incidents have very different timelines, the two were measured separately, which they called “BugMTTR” and “IncidentMTTR”). Goals: mean IncidentMTTR of under 4 hours; mean BugMTTR according to severity, ranging from 24 hours for critical issues to 90 days for low-sev.

Security Engineering Team

The team: a security engineering team, with an “embedded” engagement model - meaning members of the team embedded with different engineering teams, usually for several months, to work on major initiatives with high risk or some sort of security focus.

How they defined and used DORA metrics

  • DF: measured indirectly by looking at the DF of the teams they were embedded within. The goal was for DF to be unchanged by the presence of SecEng. In other words, SecEng shouldn’t slow down the team’s frequency.

  • LT: measured pretty similarly to LT in the previous example, just for changes that SecEng worked on directly (as an engineer, reviewer, etc). As with DF, the target was “don’t negatively affect LT of the team you’ve joined”.

  • CFR: defined specifically as security issues introduced by a change made when SecEng was involved. Target was a CFR < 1% (which I argued was way too high, but that’s a story for another time).

  • MTTR: also defined specificity for security issues in products SecEng had worked on. Measured by time elapsed between initial discovery and final remediation. Goals were defined by the company’s vulnerability management policy, ranging from 7 days for Critical to 90 days for Low.

SRE Team

The team: an SRE team that acted as a service team for about a dozen other engineering teams. SRE wasn’t responsible for centralized operations (each team operated and was on-call for its own components), but was responsible for “leveling up” the organization, overall.

How they defined and used DORA metrics:

  • DF: SRE monitored DF for each component team; each team measured DF similarly to the first example above. Each quarter, they’d work with the team with the lowest DF to try to increase it.
  • LT: not used by this SRE team.
  • CFR: SRE monitored CFR for each component team, which was defined as “percentage of deploys causing an incident”. Once again would work with the lowest-performing team(s) to reduce their CFR. Interestingly, they’d sometimes find correlation between CFRs of certain pairs/groups of teams, which would indicate some sort of failure dependency. Often that turned into work helping those teams decouple their failure states from each other.
  • MTTR: SRE found that MTTR was highly correlated across pretty much all component teams, so they ended up measuring MTTR holistically for the entire org. Goal was to keep MTTR below 4 hours (this translated to a specific SLA they offered; 4 hours didn’t come out of nowhere).

Have you used DORA metrics? How’d it go?

Hope this helps! If you’ve used DORA metrics yourself, I’d love to know how it went. Get in touch: @jacobian on Twitter, or jacob @ this domain.