IT Environments – Top 5 Deployment Metrics

February, 2019

By Christian Meléndez.

Preamble: Following on from our previous blog on Enterprise Release Management bench-marking, Christian talks about the “sister disciple” of deployment management and more specifically the top 5 metrics one could use to better understand and optimize their operations and streamline application lead times and project delivery.  

Introduction It’s time to talk about the top five deployment metrics that you could use to make wise decisions when improving the lead time of an application. Mapping your value stream is an excellent way to identify where you’re having problems shipping application changes on time. Deployment metrics will tell you when and how much time you need to spend improving the deployment pipeline’s quality and make it more efficient. When talking about deployments, the key metrics are related to frequency and how much time deployments are taking. But quality is also an important metric. How many deployments fail? Ideally, none. How much time does it take to recover when there’s a failure? How many deployments are on schedule? There are going to be a lot of reasons why these metrics could be wrong, like spending too much time doing manual tasks. But without all this information it’s hard to know where to focus your efforts.

The Number of Deployments

Just because the current trend is to make several deployments a day doesn’t mean that you must focus on deploying changes even if you don’t need to. The number of deployments you execute says a lot about how frequently you’re changing the application. The more frequently you deploy, the smaller the changes and the risk they carry are. It’s really different to do one big deployment to change the database engine as opposed to changing the database engine little by little—for example by using feature flags. The best approach to how frequently you should deploy resonates with the saying if it hurts, do it more often. Martin Fowler wrote about this saying, explaining that frequency reduces the difficulty of activities. And we all know that deployments can be both difficult and risky. Most of the time when there’s a downtime it’s because a deployment just occurred. Fearing such downtimes, we decide to accumulate changes or set tight schedules to deploy. Increasing the number of deployments you do helps you to improve deployments until they become dull and deterministic.

The Time for Each Deployment

If you want to deploy frequently, deployments shouldn’t take too much time to finish. The amount of time a deployment takes is a sign of how complex or simple the steps to deploy a change for the application are. Maybe you need to update a lot of servers, and you’re updating them one by one. Or maybe there are many dependencies that you need to take care of to finish the deployment. For example, you need to update the database or update the HTTP endpoint for the services that depend on the application you deploy. Time is also a sign of how many manual processes still exist in your deployment pipeline. The more manual labor exist, the riskier a deployment becomes. And that risk only increases when there’s pressure to ship an application change. Everything that can go wrong will go wrong. This is especially true when someone is following a manual or trusting their ability to remember every step. Just one mistake in a deployment, and chances are high that you’ll need to start over again. So is the answer to improving deployment time getting rid of manual labor with automation? Well, sometimes the effort is worth it. Other times, the problem might be in a tight architecture where every fix could generate a new issue. An application should be able to be deployed independently without having to coordinate too many things. A deployment that takes a few minutes or less will also allow you to increase the frequency of deployments.

The Number of Failed Deployments

Now let’s say that you’ve been able to (or at least are able to) deploy more frequently and deployments are fast. How many of these deployments do you fail? Hopefully not many. But if you know that the failure rate has increased, you can decide to slow down a little bit and start focusing all your energy on improving the deployment’s quality. Correlate the number of errors in the application with deployments. The next time a downtime occurs or the availability of the application is not good, you could start by asking if this starting to happen because a deployment just happened. When the deployment failure rate increases, maybe you need to increase the code coverage, add more integration tests, add performance tests, or implement canary releases. But what I’ve seen happening most of the time is that the quality of deployments decreases because deployments are treated differently in production than other environments. Having differences in the way you deploy your application to all different environments will cause you problems. Deploy the same way everywhere, like using the same scripts, recipes, or playbooks. Having a low failure rate will increase the confidence in the team and will let you know that you’re on the right path to improve the lead time of your deployments.

The Mean Time to Recover After a Downtime

Every time a deployment happens, there’s always the possibility of causing downtime in the application. Reasons behind a downtime may vary, like missing test cases or a type in the number of servers that have to be updated with the new code. Having 100 percent of availability for a service is expensive, and it isn’t worth it. Instead of trying to enforce how to avoid any downtime, try to focus on recovering as fast as you can when something goes wrong. Start by setting the baseline to define how the system looks when it’s healthy. Then take action every time something gets outside the parameters of that “normal” state. For example, you might define a service level objective (SLO) to have a latency under 500ms. If, after a deployment, the latency of the system turns out to be above the SLO, you might want to roll back immediately. Ideally, if you’ve been able to improve the deployment frequency and it doesn’t take too much time for the deployment to finish, the mean time to recover should be small. I mean, it’s just about doing another deployment, ordinary business by this point. It could be about changing the artifact version to a previous one. Remember, the time it takes to deploy a change will indicate how hard or easy it is to make a change in the application. The best way to keep the mean time to recover small is by practicing the rollback mechanism. When everything is under version control, it gets easier—it’s just about going back to the previous version and deploying again.

The Number of Deployments on Schedule

It doesn’t matter if you’re able to ship application changes rapidly if they’re not on time, live in production. The number of on-schedule deployments is a good metric to motivate the team to improve the deployment pipeline. Also, anyone that’s interested in knowing more about the reasons why the need to introduce automation or infrastructure as code will find this metric helpful. I’ve worked on projects where we needed to change many things in every deployment. A few of the changes were done manually. There were even times where unexpected problems arose in a testing environment. These problems had a significant impact on the testing phase of the application. We lacked heterogeneal environments. It was possible that something could happen when pushing the application changes to the next environment. Having deployments that are being done on schedule is a sign that your deployment pipeline is in good condition. You frequently deploy, it takes just minutes, and more importantly, you have a steady quality.

How Are Your Deployment Metrics Doing?

Before you start using these metrics to identify waste and efficiently deployments, you need to know your current state. What are the current numbers for these type of metrics in your workloads? If you don’t know what the current state is, you’ll end up improving processes that don’t add value. Using these deployment metrics will help you identify where there’s waste in your value stream mapping. You’ll have a clear view of where your efforts should be oriented to improve speed and quality when shipping new features.
Author Christian Meléndez. Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.  

Relevant Articles

Sand Castles and DevOps at Scale

03JUNE, 2022 by Niall Crawford & Carlos "Kami" Maldonado. Modified by Eric Goebelbecker.DevOps at scale is what we call the process of implementing DevOps culture at big, structured companies. Although the DevOps term was back in 2009, most organizations still...

Test Environment Management Explained

Test Environment Management Explained3JUNE, 2022 by Erik Dietrich, Ukpai Ugochi, and Jane Temov. Modified by Eric GoebelbeckerMost companies spend between 45%-55% of their IT budget on non-production activities like  Training, Development & Testing and lose 20-40%...

Serverless Computing for Dummies

3JUNE, 2022 by Eric GoebelbeckerWhat Is Serverless Computing? Serverless computing is a cloud architecture where you don’t have to worry about buying, building, provisioning, or maintaining servers. In return for structuring your code around their APIs, your cloud...

Test Environments – The Tracks for Agile Release Trains

25MAY, 2022 by Niall Crawford & Justin Reynolds. Modified by Eric Goebelbecker.So, you’ve decided to implement a Scaled Agile Framework (SAFe) and promote a continuous delivery pipeline by implementing “Agile Release Trains” (ART)*.  Definition: An Agile Release...

What Is Data Masking and How Do We Do It?

24MAY, 2022 by Michiel Mulders. Modified by Eric Goebelbecker.With the cost of data breaches increasing every year, there’s a need for higher security standards. According to IBM’s 2021 security report, the average total cost of a data breach has risen to $4.24...

Test Environments: Why You Need One and How to Set It Up

24MAY, 2022 by Keshav MalikWith the rise of agile development methodologies, the need to quickly test new features is more critical than ever. This is especially true for websites and applications that rely on real-time data and interaction. The only way to ensure...