IT Environments – Top 5 Deployment Metrics

February, 2019

By Christian Meléndez.

Preamble: Following on from our previous blog on Enterprise Release Management bench-marking, Christian talks about the “sister disciple” of deployment management and more specifically the top 5 metrics one could use to better understand and optimize their operations and streamline application lead times and project delivery.  

Introduction It’s time to talk about the top five deployment metrics that you could use to make wise decisions when improving the lead time of an application. Mapping your value stream is an excellent way to identify where you’re having problems shipping application changes on time. Deployment metrics will tell you when and how much time you need to spend improving the deployment pipeline’s quality and make it more efficient. When talking about deployments, the key metrics are related to frequency and how much time deployments are taking. But quality is also an important metric. How many deployments fail? Ideally, none. How much time does it take to recover when there’s a failure? How many deployments are on schedule? There are going to be a lot of reasons why these metrics could be wrong, like spending too much time doing manual tasks. But without all this information it’s hard to know where to focus your efforts.

The Number of Deployments

Just because the current trend is to make several deployments a day doesn’t mean that you must focus on deploying changes even if you don’t need to. The number of deployments you execute says a lot about how frequently you’re changing the application. The more frequently you deploy, the smaller the changes and the risk they carry are. It’s really different to do one big deployment to change the database engine as opposed to changing the database engine little by little—for example by using feature flags. The best approach to how frequently you should deploy resonates with the saying if it hurts, do it more often. Martin Fowler wrote about this saying, explaining that frequency reduces the difficulty of activities. And we all know that deployments can be both difficult and risky. Most of the time when there’s a downtime it’s because a deployment just occurred. Fearing such downtimes, we decide to accumulate changes or set tight schedules to deploy. Increasing the number of deployments you do helps you to improve deployments until they become dull and deterministic.

The Time for Each Deployment

If you want to deploy frequently, deployments shouldn’t take too much time to finish. The amount of time a deployment takes is a sign of how complex or simple the steps to deploy a change for the application are. Maybe you need to update a lot of servers, and you’re updating them one by one. Or maybe there are many dependencies that you need to take care of to finish the deployment. For example, you need to update the database or update the HTTP endpoint for the services that depend on the application you deploy. Time is also a sign of how many manual processes still exist in your deployment pipeline. The more manual labor exist, the riskier a deployment becomes. And that risk only increases when there’s pressure to ship an application change. Everything that can go wrong will go wrong. This is especially true when someone is following a manual or trusting their ability to remember every step. Just one mistake in a deployment, and chances are high that you’ll need to start over again. So is the answer to improving deployment time getting rid of manual labor with automation? Well, sometimes the effort is worth it. Other times, the problem might be in a tight architecture where every fix could generate a new issue. An application should be able to be deployed independently without having to coordinate too many things. A deployment that takes a few minutes or less will also allow you to increase the frequency of deployments.

The Number of Failed Deployments

Now let’s say that you’ve been able to (or at least are able to) deploy more frequently and deployments are fast. How many of these deployments do you fail? Hopefully not many. But if you know that the failure rate has increased, you can decide to slow down a little bit and start focusing all your energy on improving the deployment’s quality. Correlate the number of errors in the application with deployments. The next time a downtime occurs or the availability of the application is not good, you could start by asking if this starting to happen because a deployment just happened. When the deployment failure rate increases, maybe you need to increase the code coverage, add more integration tests, add performance tests, or implement canary releases. But what I’ve seen happening most of the time is that the quality of deployments decreases because deployments are treated differently in production than other environments. Having differences in the way you deploy your application to all different environments will cause you problems. Deploy the same way everywhere, like using the same scripts, recipes, or playbooks. Having a low failure rate will increase the confidence in the team and will let you know that you’re on the right path to improve the lead time of your deployments.

The Mean Time to Recover After a Downtime

Every time a deployment happens, there’s always the possibility of causing downtime in the application. Reasons behind a downtime may vary, like missing test cases or a type in the number of servers that have to be updated with the new code. Having 100 percent of availability for a service is expensive, and it isn’t worth it. Instead of trying to enforce how to avoid any downtime, try to focus on recovering as fast as you can when something goes wrong. Start by setting the baseline to define how the system looks when it’s healthy. Then take action every time something gets outside the parameters of that “normal” state. For example, you might define a service level objective (SLO) to have a latency under 500ms. If, after a deployment, the latency of the system turns out to be above the SLO, you might want to roll back immediately. Ideally, if you’ve been able to improve the deployment frequency and it doesn’t take too much time for the deployment to finish, the mean time to recover should be small. I mean, it’s just about doing another deployment, ordinary business by this point. It could be about changing the artifact version to a previous one. Remember, the time it takes to deploy a change will indicate how hard or easy it is to make a change in the application. The best way to keep the mean time to recover small is by practicing the rollback mechanism. When everything is under version control, it gets easier—it’s just about going back to the previous version and deploying again.

The Number of Deployments on Schedule

It doesn’t matter if you’re able to ship application changes rapidly if they’re not on time, live in production. The number of on-schedule deployments is a good metric to motivate the team to improve the deployment pipeline. Also, anyone that’s interested in knowing more about the reasons why the need to introduce automation or infrastructure as code will find this metric helpful. I’ve worked on projects where we needed to change many things in every deployment. A few of the changes were done manually. There were even times where unexpected problems arose in a testing environment. These problems had a significant impact on the testing phase of the application. We lacked heterogeneal environments. It was possible that something could happen when pushing the application changes to the next environment. Having deployments that are being done on schedule is a sign that your deployment pipeline is in good condition. You frequently deploy, it takes just minutes, and more importantly, you have a steady quality.

How Are Your Deployment Metrics Doing?

Before you start using these metrics to identify waste and efficiently deployments, you need to know your current state. What are the current numbers for these type of metrics in your workloads? If you don’t know what the current state is, you’ll end up improving processes that don’t add value. Using these deployment metrics will help you identify where there’s waste in your value stream mapping. You’ll have a clear view of where your efforts should be oriented to improve speed and quality when shipping new features.
Author Christian Meléndez. Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.  

Relevant Articles

Self-Healing Applications

02NOVEMBER, 2022 by Sylvia Froncza Original March 11 2019An IT and Test Environment Perspective Traditionally, test environments have been difficult to manage. For one, data exists in unpredictable or unknown states. Additionally, various applications and services...

Data Operations: Defined and Explained

01NOVEMBER, 2022 by Justin Reynolds.Businesses across the board are spinning their tires when it comes to data and analytics, with many of them failing to unlock maximum value from their investments. According to one study, 89% of companies face challenges around how...

Software Security Anti-Patterns

02NOVEMBER, 2022 by Eric Boersma *Original 22 October 2019If you're like a lot of developers, you might not think much about software security. Sure, you hash your users' passwords before they're stored in your database. You don't return sensitive information in error...

What makes a good Test Environment Manager?

14 OCTOBER 2022 by Daniel de OliveiraIn today’s application-based world, companies are releasing more applications than ever before. Software delivery life cycles are becoming more complicated. As a result, large companies require hundreds and even thousands of test...

Staging Server Success: The Essential Guide To Setup and Use

01NOVEMBER, 2022 by EricStaging Server Success: The Essential Guide To Setup and Use Release issues happen.  Maybe it's a new regression you didn't catch in QA. Sometimes it's a failed deploy. Or, it might even be an unexpected hardware conflict.  How do you catch...

What makes a good Test Data Manager?

19 NOVEMBER, 2020 by Michiel Mulders What Makes a Good Test Data Manager? Have you implemented test data management at your organization? It will surely benefit you if your organization processes critical or sensitive business data. The importance of test data is...